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for  modeling  the  topic  proportions.  Our  proof  techniques  incorporate  a  novel  class  of  tensor  decompositions  which 
falls  in  between  the  well-known  candecomp/parafac  (CP)  and  the  more  general  Tucker  decomposition. 


When  are  Overcomplete  Representations  Identifiable? 
Uniqueness  of  Tensor  Decompositions  Under  Expansion  Constraints 

Animashree  Anandkumar,  Daniel  Hsu,  Majid  Janzamin  and  Sham  Kakade* 

June  16,  2013 


Abstract 

Overcomplete  latent  representations  have  been  very  popular  for  unsupervised  feature  learn¬ 
ing  in  recent  years.  In  this  paper,  we  specify  which  overcomplete  models  can  be  identified  given 
observable  moments  of  a  certain  order.  We  consider  probabilistic  admixture  or  topic  models 
in  the  overconrplete  regime.  While  general  overcomplete  admixtures  are  not  identifiable,  we 
establish  generic  identifiability  under  a  constraint,  referred  to  as  topic  persistence.  Our  suffi¬ 
cient  conditions  for  identifiability  involve  a  novel  set  of  expansion  conditions  on  the  population 
structure  (i.e.  the  topic-word  matrix)  of  the  persistent  topic  model.  Specifically,  we  require  the 
existence  of  a  perfect  matching  from  hidden  variables  to  higher  order  observed  variables,  and 
can  thus,  incorporate  overconrplete  models.  In  particular,  we  establish  that  random  models  are 
identifiable  w.h.p.  in  the  overcomplete  regime.  Moreover,  our  analysis  allows  for  general  (non¬ 
degenerate)  distributions  for  modeling  the  topic  proportions.  Our  proof  techniques  incorporate 
a  novel  class  of  tensor  decompositions  which  fall  in  between  the  well-known  candecomp/parafac 
(CP)  and  Tucker  decompositions,  and  provide  novel  conditions  for  unique  tensor  decomposition. 

Keywords:  Overcomplete  representation,  admixture  models,  generic  identifiability,  tensor  decom¬ 
position. 


1  Introduction 


The  performance  of  many  machine  learning  methods  is  hugely  dependent  on  the  choice  of  data 
representations  or  features.  Overcomplete  representations,  where  the  number  of  features  can  be 
greater  than  the  dimensionality  of  the  input  data,  have  been  extensively  employed,  and  are  ar¬ 
guably  critical  in  a  number  of  applications  such  as  speech  and  computer  vision  [1],  Overconrplete 
representations  are  known  to  be  more  robust  to  noise,  and  can  provide  greater  flexibility  in  mod¬ 
eling  [2].  Unsupervised  estimation  of  overconrplete  representations  has  been  hugely  popular  due  to 
the  availability  of  large-scale  unlabeled  samples  in  many  applications. 

A  probabilistic  framework  for  incorporating  features  often  posits  latent  or  hidden  variables  that 
can  provide  a  good  explanation  to  the  observed  data.  Overcomplete  probabilistic  models  can  have 

*A.  Anandkumar  and  M.  Janzamin  are  with  the  Center  for  Pervasive  Communications  and  Computing, 
Electrical  Engineering  and  Computer  Science  Dept.,  University  of  California,  Irvine,  USA  92697.  Email: 
a.anandkumar@uci.edu, mjanzami@uci.edu.  Daniel  Hsu  and  Sham  Kakade  are  with  Microsoft  Research  New  England, 
1  Memorial  Drive,  Cambridge,  MA  02142.  Email:  dahsu@microsoft.com,  skakade@microsoft.com 
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a  latent  space  dimensionality,  which  can  be  far  exceed  the  observed  dimensionality.  In  this  paper, 
we  characterize  the  conditions  under  which  overcomplete  latent  variable  models  can  be  identified 
from  their  observed  moments. 

For  any  parametric  statistical  model,  identifiability  is  a  fundamental  question  of  whether  the  model 
parameters  can  be  uniquely  recovered  given  the  observed  statistics.  Identifiability  is  crucial  in  a 
number  of  applications  where  the  latent  variables  are  the  quantities  of  interest,  e.g.  inferring  dis¬ 
eases  (latent  variables)  through  symptoms  (observations),  inferring  communities  (latent  variables) 
via  the  interactions  of  the  actors  in  a  social  networks  (observations),  and  so  on.  Moreover,  identifia¬ 
bility  can  be  relevant  even  in  predictive  settings,  where  feature  learning  is  employed  for  some  higher 
level  task  such  as  classification.  For  instance,  non-identifiability  can  lead  to  the  presence  of  non¬ 
isolated  local  optima  for  optimization-based  learning  methods,  which  can  affect  their  convergence 
properties,  e.g.  see  [3]. 

In  this  paper,  we  characterize  identifiability  for  a  popular  class  of  latent  variable  models,  known 
as  the  admixture  or  topic  models  [4,5].  These  are  hierarchical  mixture  models,  which  incorporate 
the  presence  of  multiple  latent  states  (i.e.  topics)  in  documents  consisting  of  a  tuple  of  observed 
variables  (i.e.  words).  In  this  paper,  we  characterize  conditions  under  which  the  topic  models  are 
identified  through  their  observed  moments  in  the  overcomplete  regime.  To  this  end,  we  introduce 
an  additional  constraint  on  the  model,  referred  to  as  topic  persistence.  Intuitively,  this  captures 
the  “locality”  effect  among  the  observed  words,  and  goes  beyond  the  usual  “bag-of-words”  or 
exchangeable  topic  models.  Such  local  dependencies  among  observations  abound  in  applications 
such  as  text,  images  and  speech,  and  can  lead  to  more  faithful  representation.  In  addition,  we 
establish  that  the  presence  of  topic  persistence  is  central  to  obtaining  model  identifiability  in  the 
overcomplete  regime,  and  we  provide  an  in-depth  analysis  of  this  phenomenon  in  this  paper. 


1.1  Summary  of  results 

In  this  paper,  we  provide  conditions  for  generic 1  model  identifiability  of  overcomplete  topic  models 
given  observable  moments  of  a  certain  order  (i.e.,  a  certain  number  of  words  in  each  document).  We 
introduce  a  novel  constraint,  referred  to  as  topic  persistence ,  and  analyze  its  effect  on  identifiability. 
We  establish  identifiability  in  the  presence  of  a  novel  combinatorial  object,  referred  to  as  perfect 
n-gram  matching ,  in  the  bipartite  graph  from  topics  to  words  (observed  variables).  Finally,  we 
prove  that  random  models  satisfy  these  criteria,  and  are  thus  identifiable  in  the  overcomplete 
regime. 

We  first  introduce  the  n-persistent  topic  model,  where  the  parameter  n  determines  the  so-called 
persistence  level  of  a  common  topic  in  a  sequence  of  n  successive  words,  as  seen  in  Figure  1.  The 
n-persistent  model  reduces  to  the  popular  “bag-of-words”  model,  when  n  =  1,  and  to  the  single 
topic  model  (i.e.  only  one  topic  in  each  document)  when  n  — >■  oo.  Intuitively,  topic  persistence  aids 
identifiability  since  we  have  multiple  views  of  the  common  hidden  topic  generating  a  sequence  of 
successive  words.  We  establish  that  the  bag-of-words  model  (with  n  =  1)  is  too  non-informative 
about  the  topics  to  be  identifiable  in  the  overcomplete  regime.  On  the  other  hand,  n-persistent  over¬ 
complete  topic  models  with  n  >  2  are  generically  identifiable,  and  we  provide  a  set  of  transparent 

XA  model  is  generically  identifiable,  if  all  the  parameters  in  the  parameter  space  are  identifiable,  almost  surely. 
Refer  to  Definition  1  for  more  discussion. 
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Figure  1:  Hierarchical  structure  of  the  n-persistent  topic  model.  2 rn  number  of  words  (views)  are 
shown  for  some  integer  r  >  1.  A  single  topic  yj,j  £  [2 r],  is  chosen  for  each  n  successive  views 

— l)n+l  i  *  ■  *  5  %(j  —  l)n+n}  • 

conditions  for  identifiability. 

Our  sufficient  conditions  for  identifiability  are  in  the  form  of  expansion  conditions  from  the  latent 
topic  space  to  the  observed  word  space.  In  the  overcomplete  regime,  there  are  more  topics  than 
words,  and  thus  it  is  impossible  to  have  expansion  from  topics  to  words.  Instead,  we  impose  a  novel 
expansion  constraint  from  topics  to  “higher  order”  words,  which  allows  us  to  handle  overcomplete 
models.  We  establish  that  this  condition  translates  to  the  presence  of  a  novel  combinatorial  object, 
referred  to  as  perfect  n-gram  matching ,  on  the  bipartite  graph  from  topics  to  words,  which  encodes 
the  sparsity  pattern  of  the  topic-word  matrix.  Intuitively,  this  condition  implies  “diversity”  of  the 
word  support  for  different  topics  which  leads  to  identifiability.  In  addition,  we  present  trade-offs 
between  the  topic  and  word  space  dimensionality,  topic  persistence  level,  the  order  of  the  observed 
moments  at  hand,  the  maximum  degree  of  any  topic  in  the  bipartite  graph,  and  the  Kruskal  rank  [6] 
of  the  topic-word  matrix,  for  identifiability  to  hold.  We  also  show  that  fi-based  optimization  can 
efficiently  recover  the  model  under  some  additional  conditions. 

We  then  explicitly  characterize  the  regime  of  identifiability  for  the  random  case,  where  each  topic 
is  randomly  supported  on  a  set  of  d  words,  i.e.  the  bipartite  graph  is  a  random  d-regular  graph. 
For  this  d-random  model  with  q  topics,  p-dimensional  word  vocabulary,  and  topic  persistence  level 
n,  when  q  =  0{jpn )  and  0(logp)  <  d  <  Q(p1^n),  the  topic-word  matrix  is  identifiable  from  2?rth 
order  observed  moments  with  high  probability.  Furthermore,  we  establish  that  the  size  condition 
q  =  0(pn)  is  tight  for  identifiability.  Thus,  we  prove  that  random-structured  topic  models  are 
identifiable  in  the  overcomplete  regime. 

To  the  best  of  our  knowledge,  this  is  the  first  work  to  provide  expansion-based  conditions  for 
characterizing  identifiability  of  overcomplete  admixture  models.  We  prove  these  results  by  charac¬ 
terizing  the  tensor  algebra  underlying  the  observed  moments  of  the  topic  model.  We  establish  that 
model  identifiability  for  persistent  topic  models  reduces  to  establishing  uniqueness  for  a  new  class 
of  tensor  decompositions.  For  the  special  case  of  the  bag-of-words  model  (with  persistence  level 
1),  this  tensor  decomposition  reduces  to  the  Tucker  decomposition  [7],  while  for  the  single  topic 
model  (with  infinite  persistence),  it  reduces  to  the  candecomp/parafac  (CP)  decomposition.  Thus, 
our  in-depth  analysis  provides  novel  identifiability  results  for  overcomplete  tensor  decomposition 
under  expansion  conditions. 
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1.2  Related  works 


Identifiability,  learning  and  applications  of  overcomplete  latent  representations:  Many 
recent  works  employ  unsupervised  estimation  of  overcomplete  features  for  higher  level  tasks  such 
classification,  e.g.  [1,8-10],  and  record  huge  gains  over  other  approaches  in  a  number  of  applications 
such  as  speech  recognition  and  computer  vision.  However,  theoretical  understanding  regarding 
learnability  or  identifiability  of  overcomplete  representations  is  far  more  limited. 

Overcomplete  latent  representations  have  been  analyzed  in  the  context  of  the  independent  com¬ 
ponents  analysis  (ICA),  where  the  sources  are  assumed  to  be  independent,  and  the  mixing  matrix 
is  unknown.  In  the  overcomplete  or  under-determined  regime  of  the  ICA,  there  are  more  sources 
than  sensors.  Identifiability  and  learning  of  the  overcomplete  ICA  (through  the  analysis  of  the 
resulting  overcomplete  CP  tensor  decomposition)  has  been  considered  in  [11-14].  However,  their 
results  are  not  directly  applicable  to  ad-mixture  models  since  they  result  in  tensor  decompositions 
which  are  more  general  than  the  CP  decomposition  used  in  these  works.  Moreover,  we  explicitly 
characterize  the  effect  of  the  sparsity  pattern  of  the  mixing  matrix  (i.e.,  the  topic- word  matrix)  on 
model  identifiability,  while  the  above  works  assume  fully  dense  (generic)  mixing  matrices. 

There  are  a  number  of  works  which  analyze  conditions  for  generic  identifiability  of  a  variety  of 
overcomplete  latent  variable  models  such  as  the  phylogenetic  tree  models  [15,  16].  These  results 
provide  conditions  for  strict  identifiability  of  the  model,  and  here,  the  dimensionality  of  the  latent 
space  has  to  be  of  the  same  order  as  the  observed  space  dimensionality.  In  contrast,  we  use  the 
weaker  notion  of  generic  identifiability  and  can  therefore,  allow  for  the  latent  space  dimensionality 
to  scale  polynomially  in  the  observed  space  dimensionality.  The  above  works  primarily  utilize  the 
Kruskal’s  result  on  uniqueness  of  tensor  CP  decompositions  [6,17]  to  derive  the  identifiability  results. 
Recently,  an  extension  of  these  identifiability  results  to  the  robust  setting  has  been  considered  in  [18]. 
A  number  of  recent  works  also  analyze  generic  identifiability  of  overcomplete  tensors  [19-21]  by 
utilizing  tools  from  algebraic  geometry.  For  a  general  overview  of  the  algebraic  geometry  behind 
tensor  decompositions,  see  [7]. 


Identifiability  and  learning  of  undercomplete/over-determined  latent  representations: 

Much  of  the  theoretical  results  on  identifiability  and  learning  of  the  latent  variable  models  are 
limited  to  non-singular  models,  which  prevents  the  latent  space  dimensionality  from  exceeding  the 
dimensionality  of  the  observed  space. 

The  works  of  Anandkumar  et.  al.  [22-24]  provide  an  efficient  moment-based  approach  for  learn¬ 
ing  topic  models,  under  constraints  on  the  distribution  of  the  topic  proportions,  e.g.  the  single 
topic  model  or  the  popular  latent  Dirichlet  allocation  (LDA).  However,  these  works  cannot  handle 
general  admixture  models,  where  the  distribution  of  the  topic  proportions  is  not  limited  to  these 
classes.  In  addition,  the  approach  can  handle  a  variety  of  latent  variable  models  such  as  Gaussian 
mixtures,  hidden  Markov  models  (HMM)  and  community  models  [25].  The  use  of  simultaneous 
diagonalization  based  approaches  for  learning  HMM’s  has  been  considered  in  a  number  of  earlier 
works,  e.g.  [26,27]. 

Our  work  is  closely  related  to  the  work  of  Anandkumar  et.  al.  [28]  which  considers  identifiability 
and  learning  of  topic  models  under  expansion  conditions  on  the  topic-word  matrix.  The  work  of 
Spielman  et.  al  [29]  considers  a  similar  model  in  the  context  of  dictionary  learning,  but  in  addition 


4 


assumes  that  the  coefficient  matrix  is  random.  However, these  works  [28,  29]  can  handle  only  the 
under-determined  setting,  where  the  number  of  topics  is  less  than  the  dimensionality  of  the  word 
vocabulary.  We  extend  these  results  to  the  overcomplete  setting  by  proposing  novel  higher  order 
expansion  conditions  on  the  topic-word  matrix. 


Dictionary  learning/sparse  coding:  Overcomplete  representations  have  been  very  popular 

in  the  context  of  dictionary  learning  or  sparse  coding.  Here,  the  task  is  to  jointly  learn  a  dictionary 
as  well  as  a  sparse  selection  of  the  dictionary  atoms  to  fit  the  observed  data.  There  have  been 
Bayesian  as  well  as  frequentist  approaches  for  dictionary  learning  [2, 30, 31]  However,  the  heuris¬ 
tics  employed  in  these  works  have  no  performance  guarantees.  The  work  of  Spielman  et.  al  [29] 
considers  learning  undercomplete  dictionaries  and  provide  guaranteed  learning  under  the  assump¬ 
tion  that  the  coefficient  matrix  is  random  (distributed  as  Bernoulli-Gaussian  variables).  Recent 
works  [32,  33]  provide  generalization  bounds  for  predictive  sparse  coding,  where  the  goal  of  the 
learned  representation  is  to  obtain  good  performance  on  some  predictive  task.  This  differs  from 
our  framework  since  we  do  not  consider  predictive  tasks  here,  but  the  question  of  recovering  the 
underlying  latent  representation. 


2  Model 


Notation:  The  set  {1,2,...,  n}  is  denoted  by  [n]  :=  {1,2,...,  n}.  Given  set  X  =  {1, . . .  ,p},  set 

X<'n't  denotes  all  ordered  n-tuples  generated  from  X.  The  cardinality  of  set  S  is  denoted  by  |5|. 
For  any  vector  u  (or  matrix  U ),  the  support  denoted  by  Supp(w)  corresponds  to  the  location  of  its 
non-zero  entries  and  the  £q  norm  denoted  by  ||u||o  corresponds  to  the  number  of  non-zero  entries 
of  u,  i.e. ,  Ho  :=  |Supp(u)|.  For  a  vector  u  £  M9,  Diag(u)  £  Rqxq  is  the  diagonal  matrix  with  u 
on  its  main  diagonal.  The  column  space  of  a  matrix  A  is  denoted  by  Col(H).  We  refer  to  matrix 
R  £  Rmxr  as  a  square  root  of  matrix  M  £  Mmxm  if  rr1  =  M. 

For  A  £  Wpxq  and  B  £  Rmxn,  the  Kronecker  product  A®  B  £  MpmXf/n  is  defined  as  [34] 


A®B  = 


and  for  A  =  [ai|a2|  •  •  •  |ar]  £  Mpxr  and 
Mpmxr  is  defined  as 


auB 

ai2B  ■ 

•  •  •  alqB 

a2iB 

a22B  ■ 

■  ■  ■  a2qB 

dpi  B 

ap2B  ' 

dpq  B 

B  =  [b1\b2\---\br] 

£  Mmxr, 

the  Khatri-Rao  product  AQ  B  £ 


AQ  B  =  [a\  <8)  6i|a2  <8>  62 1  •  •  •  \ar  <g>  br\  . 

2.1  Persistent  topic  model 

In  this  section,  the  n-persistent  topic  model  is  introduced  which  imposes  an  additional  constraint, 
known  as  topic  persistence  on  the  popular  admixture  model  [4,5,35].  The  n-persistent  topic  model 
reduces  to  the  bag-of- words  admixture  model  in  the  case  of  n  =  1. 
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An  admixture  model  specifies  a  (/-dimensional  vector  of  topic  proportions  h  G  A9-1  :=  {u  G  M9  : 
ut  >  0,  1  ui  =  1}  which  generates  the  observed  variables  xi  G  Rp  through  vectors  a± ,...  ,aq  G 

Mp.  This  collection  of  vectors  ai,i  G  [q],  is  referred  to  as  the  population  structure  or  topic-word, 
matrix  [35].  For  instance,  a*  represents  the  conditional  distribution  of  words  given  topic  i.  The 
latent  variable  h  is  a  q  dimensional  random  vector  h  :=  [hi, . . . ,  hq]T  known  as  proportion  vector.  A 
prior  distribution  P{h)  over  the  probability  simplex  A9-1  characterizes  the  prior  joint  distribution 
over  the  latent  variables  hi,  i  G  [</].  In  the  topic  modeling,  this  is  the  prior  distribution  over  the 
distinct  q  topics. 

The  structure  of  the  n-persistent  topic  model  has  a  three-level  multi-view  hierarchy  in  Figure  1. 
2 rn  number  of  words  (views)  are  shown  in  the  model  for  some  integer  r  >  1.  In  this  model,  a 
common  hidden  topic  is  persistent  for  a  sequence  of  n  words  {x(j- i)n+i, . . . ,  ®(j_i)n+n},i  £  [2r]. 
Note  that,  the  random  observed  variables  (words)  are  exchangeable  within  groups  with  size  n  which 
is  the  persistence  level,  but  are  not  globally  exchangeable. 

We  now  describe  a  linear  representation  for  the  n-persistent  topic  model  on  lines  of  [24],  but  with 
extensions  to  incorporate  persistence.  Each  random  variable  y3 ,  j  G  [2 r\,  is  a  discrete  valued  random 
variable  taking  one  of  the  q  different  possibilities  {1, . . . ,  q},  i.e. ,  y3  G  [q]  for  j  G  [2 r\.  In  the  topic 
modeling,  a  single  common  topic  is  chosen  for  a  sequence  of  n  words  {x(j_ i)n+i, . . . ,  i)n+n},  j  G 

[2 r\.  For  notational  purposes,  we  equivalently  assume  that  variables  yj,j  G  [2 r],  are  encoded  by 
the  basis  vectors  e\,  i  G  [§],  where  e*  is  the  i-th  basis  vector  in  M9  with  the  i-th  entry  equal  to  1 
and  all  the  others  equal  to  zero.  Thus,  the  variable  ijj ,  j  G  [2 r],  can  be  interpreted  as 

pj  =  ei  G  M9  the  topic  of  j-th  group  of  words  is  i. 

Given  proportion  vector  h,  topics  yj,  j  G  [2 r\,  are  independently  drawn  according  to  the  conditional 
expectation 


E[yj\h\=h,  j  G  [2r], 

or  equivalently  Pr [yj  =  | h]  =  hi ,  j  G  [2 r\,i  G  [(/].  Note  that  for  each  sequence  of  n  observed 

variables,  the  same  hidden  variable  yj  is  assumed  in  the  n-persistent  topic  model,  i.e.,  the  topic  is 
persistent  for  n  different  views. 

Finally,  at  the  bottom  layer,  each  observed  variable  x\  for  l  G  [2rn] ,  is  a  discrete- valued  p- 
dimensional  random  variable  (word)  where  p  is  the  size  of  vocabulary.  Again,  we  assume  that 
variables  xi,  are  encoded  by  the  basis  vectors  e*,,  k  G  \p\,  such  as 

xi  =  efc  G  Mp  the  Z-tli  word  in  the  document  is  k. 

Given  the  corresponding  topic  yj,  j  G  [2?’],  words  xi,l  G  [2rn],  are  independently  drawn  according 
to  the  conditional  expectation 

^[x(j-i)n+k\yj  =  ei]  =a,i,ie  [q\,j  G  [2 r],  k  G  [n],  (1) 

where  vectors  a*  G  Mp,  i  G  [g],  are  the  conditional  probability  distribution  vectors.  The  matrix 
A  =  [a, 1 1 a.2  •  •:•']«-(-/]  €  Wpxg  collecting  these  vectors  is  called  population  structure  or  topic-word 
matrix. 
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The  (2rn)-th  order  moment  of  observed  variables  xi,  l  £  [2m],  for  some  integer  r  >  1,  is  defined  as 
(in  the  matrix  form)  2 

M2rn(x)  :=  E  [(xi  <g>  X2  <8>  •  •  •  <8>  Xrn)(xrn+ 1  <8>  Xrn+2  <8>  •  •  •  <8>  :C2rn)T]  €  (2) 

For  the  n-persistent  topic  model  with  2 rn  number  of  observations  (words)  xi,l  €  [2m],  the  corre¬ 
sponding  moment  is  denoted  by  M^n(x). 

The  moment  characterization  of  the  n-persistent  topic  model  is  provided  in  Lemma  1  in  Sec¬ 
tion  4.1.  Given  M^{x).  what  are  the  sufficient  conditions  under  which  the  population  structure 
A  =  [<xi |a2 1  ■  •  •  | aq\  £  Rpxq  is  identifiable?  This  is  answered  in  Section  3. 

Remark  1.  Note  that,  we  can  alternatively  introduce  the  linear  generative  model  xi  =  Ayj  ( more 
precisely,  Xfj_i^n+k  =  Ayj,j  £  [2r],  k  £  [n])  instead  of  the  conditional  probabilistic  model  proposed 
above.  In  this  new  model,  each  column  of  matrix  A  does  not  need  to  be  a  valid  probability  distribu¬ 
tion.  Furthermore,  the  observed  random  variables  xi,  can  be  continuous  while  the  hidden  ones  yj 
should  be  still  discrete.  It  is  crucial  to  notice  that  the  derivation  of  moments,  mentioned  later  in 
Section  4-1,  still  holds  for  this  new  model  since  each  vector  y.j,j  £  [2r],  takes  the  basis  vectors  as 
its  values.  Hence,  the  proposed  identifiability  results  in  Section  3  also  hold. 


3  Sufficient  Conditions  for  Generic  Identifiability 

In  this  section,  the  identifiability  result  for  the  n-persistent  topic  model  with  access  to  (2rn)-th 
order  moment  of  words  is  provided.  For  a  n-persistent  topic  model,  it  suffices  to  only  have  the 
(2n)-th  order  moment  (r  =  1  case)  of  words  in  order  to  be  able  to  uniquely  recover  population 
structure  A  under  proposed  sufficient  conditions. 

The  identifiability  conditions  and  results  under  deterministic  and  random  cases  are  provided  in 
this  section.  First,  sufficient  deterministic  conditions  on  the  population  structure  A  are  provided 
which  lead  to  identifiability  result  in  Theorem  1.  Next,  according  to  the  deterministic  analysis,  the 
identifiability  result  for  a  random  model  is  provided  in  Theorem  2  under  reasonable  size  and  degree 
conditions  on  the  bipartite  graph  which  encodes  sparsity  pattern  of  A. 

We  make  the  notion  of  identifiability  precise.  As  defined  in  literature,  (strict)  identifiability  means 
that  the  population  structure  A  can  be  uniquely  recovered  up  to  permutation  of  the  columns 
for  all  valid  A.  Instead,  we  consider  a  more  relaxed  notion  of  identifiability,  known  as  generic 
identifiability. 

Definition  1  (Generic  identifiability,  [16]).  Assume  that  the  population  structure  A  is  generic, 
which  means  that  the  sparsity  pattern  of  A  is  fixed  and  then  the  nonzero  entries  are  drawn  from 
a  distribution  (over  those  entries)  that  is  absolutely  continuous  with  respect  to  Lebesgue  measure 3. 
The  generic  population  structure  (parameters)  A  is  generically  identifiable  if  all  the  non-identifiable 
parameters  form  a  set  of  Lebesgue  measure  zero. 

The  (2r)-th  order  moment  of  hidden  variables  h  £  M9,  denoted  by  M2r(h)  £  Rqrxqr ,  is  defined 
2 Vector  x  is  the  vector  generated  by  concatenating  all  vectors  xi,  l  €  [2rn]. 

3As  an  equivalent  definition,  if  the  non-zero  entries  of  an  arbitrary  matrix  are  randomly  independently  perturbed 
(continuous  perturbation)  to  generate  matrix  A,  then  A  is  called  generic. 
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as 

r  times  r  times 

M2r(h)  :=  E  (h  <g>  •  •  •  ®  hj  h  <S>  ■  ■  ■  0  hj  G  M9’'*'7’'.  (3) 

The  following  natural  non-degeneracy  condition  is  assumed. 

Condition  1  (Non-degeneracy).  The  (2 r)-th  order  moment  of  hidden  variables  h  G  M'7,  defined  in 
equation  (3),  is  full  rank  (non- degeneracy  of  hidden  nodes). 

Note  that  there  is  no  hope  of  distinguishing  distinct  hidden  nodes  without  this  non-degeneracy 
assumption. 

Furthermore,  note  that  we  can  only  hope  to  identify  the  population  structure  A  up  to  scaling.  This 
is  because  the  columns  of  A  can  be  scaled  by  some  arbitrary  amount  and,  the  hidden  variables  can 
be  also  scaled  appropriately  such  that  the  observed  variables  does  not  change.  Therefore,  we  can 
identify  A  up  to  some  canonical  form  defined  as: 

Definition  2  (Canonical  form).  Population  structure  A  is  said  to  be  in  canonical  form  if  all  of  its 
columns  have  unit  norm. 

3.1  Deterministic  conditions  for  generic  identifiability 

In  this  section,  we  consider  a  deterministic  sparsity  pattern  on  the  population  structure  A  and 
establish  generic  identifiability,  i.e. ,  when  non-zero  entries  are  generically  identifiable.  Before  pro¬ 
viding  the  main  result,  a  generalized  notion  of  (perfect)  matching  for  bipartite  graphs  is  defined 
and  its  properties  are  proposed.  We  need  these  notions  to  establish  identifiability. 

Generalized  matching  for  bipartite  graphs 

A  bipartite  graph  with  two  disjoint  vertex  sets  Y  and  X  and  an  edge  set  E  between  them  is  denoted 
by  G(Y,X;E).  Given  the  bi-adjacency  matrix  A,  the  notation  G(Y,  X;  A)  is  also  used  to  denote 
a  bipartite  graph.  Here,  the  rows  and  columns  of  matrix  A  G  x  I  are  respectively  indexed 
by  X  and  Y  vertex  sets.  Furthermore,  for  any  subset  S  C  Y .  the  set  of  neighbors  of  vertices  in 
S  with  respect  to  the  edge  matrix  A  is  defined  as  Na{S)  :=  {i  G  X  :  Aij  0  for  some  j  G  S}. 
Equivalently,  it  can  be  also  defined  by  the  corresponding  edge  set  E  as  Ne(S )  :=  {i  G  X  :  (j.  i)  G 
E  for  some  j  G  S}  with  respect  to  the  edge  set  E. 

Here,  we  define  a  generalized  version  of  matching  for  a  bipartite  graph  and  refer  to  it  as  n-grarn 
matching. 

Definition  3  (n-grarn  Matching).  A  n-gram  matching  M  for  a  bipartite  graph  G(Y,X;E )  is  a 
subset  of  edges  MCE  for  which  each  vertex  j  G  Y  is  at  most  the  end-point  of  n  edges  in  M  and 
for  any  pair  of  vertices  in  Y  (ji,j2  G  Y,j \  j2),  there  exists  at  least  one  non-common  neighbor 

in  set  X  for  each  of  them  (j\  and  j2).  More  concretely,  let  Nm{j)  denote  the  set  of  neighbors  of 
vertex  j  G  Y  according  to  the  edge  subset  M  C  E.  Then,  the  following  conditions  shoidd  be  satisfied 
in  order  to  call  M  as  a  n-gram  matching.  First,  for  any  j  G  Y ,  we  have  |IVm(j)|  <  n.  Second,  for 
any  ji,j2  G  Y,ji  j2,  we  have  mm{\NM(ji)\,  \NM(j2)\}  >  I NM(ji)  n  NM{j2)\- 

8 


Figure  2:  A  bipartite  graph  G(Y,X;E)  with  |A|  =  4  and  |F|  =  6  where  the  edge  set  E  itself  is  a  perfect 
2-gram  matching. 


The  perfect  n-gram  matching  is  also  defined  as  follows. 

Definition  4  (Perfect  n-gram  Matching).  A  perfect  n-gram  matching  or  T-saturating  n-gram 
matching  for  the  bipartite  graph  G(Y,X]  E)  is  a  n-gram  matching  M  for  which  each  vertex  in  Y  is 
the  end-point  of  exactly  n  edges  in  M . 

As  an  example,  a  bipartite  graph  G(Y,  X\ E)  with  |A|  =  4  and  |T|  =  6  is  shown  in  Figure  2  for 
which  the  edge  set  E  itself  is  a  perfect  2-gram  matching. 

Remark  2.  For  special  case  n  =  1,  the  ( perfect )  n-gram  matching  reduces  to  the  regular  (perfect) 
matching  for  bipartite  graphs. 

In  the  following  remark,  which  is  proved  in  Appendix  A. 3,  a  deterministic  necessary  bound  is 
provided  on  the  size  of  a  bipartite  graph  which  has  a  perfect  n-gram  matching. 

Remark  3.  For  a  bipartite  graph  G(Y,X ;  E)  with  |T|  =  q  and  |  A|  =  p  which  has  a  perfect  n-gram 
matching,  we  have  necessarily  q  <  (J() . 

Finally,  note  that  the  existence  of  perfect  n-gram  matching  does  not  necessarily  result  in  the 
existence  of  perfect  (n  —  l)-gram  matching  or  any  other  lower  order  matchings,  e.g.,  the  bipartite 
graph  G(Y,  X]  E)  with  |X|  =  4  and  |y|  =  (9)  =  6  constructed  as  explained  in  the  proof  of  above 
remark  and  sketched  in  Figure  2,  have  a  perfect  2-gram  matching,  but  obviously  it  does  not  have 
perfect  (1-gram)  matching  (since  6  >  4).  But  the  reverse  statement  is  true.  If  the  degree  of  each 
node  (on  matching  side  Y)  is  at  least  n,  then,  the  existence  of  perfect  (n  —  l)-gram  matching  results 
in  the  existence  of  perfect  n-gram  matching,  which  is  easily  seen  from  the  definition. 


Identifiability  conditions  based  on  existence  of  perfect  n-gram  matching  in  topic-word 
graph 

Now,  we  are  ready  to  propose  the  identifiability  conditions  and  result.  The  following  identifiability 
conditions  impose  some  combinatorial  structure  on  A. 

Condition  2  (Perfect  n-gram  matching).  The  bipartite  graph  G(Vh,V0;  A)  between  hidden  and 
observed  variables,  has  a  perfect  n-gram  matching. 

The  above  condition  implies  that  the  sparsity  pattern  of  matrix  A  is  appropriately  scattered  for 
the  mapping  from  hidden  to  observed  variables  to  be  identifiable.  Intuitively,  it  means  that  every 
hidden  node  can  be  distinguished  from  another  hidden  node  by  its  unique  set  of  neighbors  under 
the  corresponding  n-gram  matching. 

Furthermore,  condition  2  is  the  key  to  be  able  to  propose  identifiability  in  the  overconrplete  regime. 
As  stated  in  the  size  bound  in  Remark  3,  for  n  >  2,  the  dimension  of  hidden  variables  can  be  more 
than  the  dimension  of  observed  variables  and  still  have  perfect  n-gram  matching  in  the  deterministic 
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case.  It  is  seen  later  in  Section  3.2  that  this  bound  (in  the  order  q  =  @(pn))  can  be  also  achieved 
in  the  random  case.  Note  that  this  overcomplete  regime  is  not  identifiable  for  n  =  1  which  is  also 
further  discussed  in  Remark  4. 

Definition  5  (Kruskal  rank,  [IT]).  The  Kruskal  rank4  or  the  krank  of  matrix  A  is  defined  as  the 
maximum  number  k  such  that  every  subset  of  k  columns  of  A  is  linearly  independent. 

Condition  3  (Krank  condition).  The  Kruskal  rank  of  matrix  A  satisfies  the  bound  krank(A)  > 
dma*(A)n,  where  dmax(A)  is  the  maximum  node  degree  of  any  column  of  A. 

In  the  overcomplete  regime,  it  is  not  possible  to  have  matrix  A  full  column  rank  and  krank  is 
necessarily  less  than  |V/,.|  =  q.  However,  note  that  a  large  enough  krank  ensures  that  appropriate 
sized  subsets  of  columns  of  A  are  linearly  independent.  For  instance,  when  krank(A)  >  1,  any 
two  columns  cannot  be  collinear.  The  above  krank  condition,  imposes  that  krank  is  large  enough 
compared  to  the  degree.  Later,  it  is  seen  that  the  above  krank  condition  can  be  also  satisfied  with 
some  sufficient  random  combinatorial  conditions. 

The  main  identifiability  result  under  deterministic  graph  structures  is  stated  in  the  following  theo¬ 
rem  for  n  >  2,  where  n  is  the  topic  persistence  level.  The  identifiability  result  relies  on  having  the 
(2rn)-th  order  moment  of  observed  variables  xi,l  €  [2rn],  defined  in  equation  (2)  as 

^d-2vrii (x)  • —  IE  [(iTl  ^  3-2  ^  ^  (-1  ^  2  ^  ^  X2 vn)  ]  ^  1^^  ^  , 

for  some  integer  r  >  1. 

Theorem  1  (Generic  identifiability  under  deterministic  topic- word  graph  structure).  Let  M^fix) 
in  equation  (2)  be  the  ( 2rn)-th  order  observed  moment  of  the  n-persistent  topic  model  for  some 
integer  r  >  1.  If  the  model  satisfies  conditions  1,  2  and  3,  then,  for  any  n  >  2,  all  the  columns  of 
population  structure  A  are  generically  identifiable  from  M^n(x).  Furthermore,  the  (2 r)-th  order 
moment  of  the  hidden  variables,  denoted  by  M2r{h),  is  also  generically  identifiable. 

The  theorem  is  proved  in  Appendix  A.  It  is  seen  that  the  population  structure  A  is  identifiable, 
given  any  observed  moment  of  order  at  least  2 n.  Increasing  the  order  of  observed  moment  results 
in  identifying  higher  order  moments  of  the  hidden  variables. 

The  above  theorem  does  not  cover  the  case  of  n  =  1.  This  is  the  usual  bag-of- words  admixture 
model.  Identifiability  of  this  model  has  been  studied  earlier  [36]  and  we  recall  it  below. 

Remark  4  (Bag-of- words  admixture  model).  Given  (2 r)-th  order  observed  moments  with  r  >  1, 
the  structure  of  the  popular  bag-of-words  admixture  model  and  the  (2r)-th  order  moment  of  hidden 
variables  are  identifiable,  when  A  is  full  column  rank  and  the  following  expansion  condition  holds 
[36] 


\NA(S)\  >  \S\  +  dmax(A),  VS  C  14,  |S|  >  2.  (4) 

Our  result  for  n  >  2  in  Theorem  1,  provides  identifiability  in  the  overcomplete  regime  with  weaker 
matching  condition  2  and  krank  condition  3.  The  matching  condition  2  is  weaker  than  the  above 
expansion  condition  which  is  based  on  the  perfect  matching  and  hence,  does  not  allow  overcomplete 
models  without  imposing  additional  conditions.  Furthermore,  the  result  for  the  bag-of-  words  admix¬ 
ture  model  requires  full  column  rank  of  A  for  identifiability  which  is  more  stringent  than  our  krank 
condition  3. 

4Note  that,  krank  is  different  from  the  general  notion  of  matrix  rank  and  it  is  a  lower  bound  for  the  matrix  rank, 
i.e. ,  Rank(A)  >  krank(A). 
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3.2  Analysis  under  random  topic-word  graph  structures 

In  this  section,  we  specialize  the  identifiability  result  to  the  random  case.  This  result  is  based  on 
more  transparent  conditions  on  the  size  and  the  degree  of  the  random  bipartite  graph  G(Vh,  Va;  A). 
We  consider  the  random  model  where  in  the  bipartite  graph  G(Vh,V0;A),  each  node  i  £  14  is 
randomly  connected  to  d  different  nodes  in  set  Va. 

Condition  4  (Size  condition).  Random  bipartite  graph  G(Vh,  V0;  A)  with  | V)( |  =  q, \V0\  =  p,  and 
A  £  satisfies  the  size  condition  q  <  (c^)n  for  some  constant  0  <  c  <  1. 

This  size  condition  is  required  to  establish  that  the  random  bipartite  graph  has  a  perfect  n-gram 
matching  (and  hence  satisfying  deterministic  condition  2).  It  is  shown  in  Section  5.2  that  the 
necessary  size  constraint  q  =  0(pn )  proposed  for  the  deterministic  case  in  Remark  3,  is  achieved 
in  the  random  case.  Thus,  similar  to  the  deterministic  case,  the  above  constraint  allows  for  the 
overcomplete  regime  where  g>  p  for  n  >  2. 

Condition  5  (Degree  condition).  In  the  random  bipartite  graph  G(Vh,  VQ\  A)  with  |14|  =  q,  \VQ\  =  p, 
and  A  £  RpX(},  the  degree  d  satisfies  the  following  lower  and  upper  bounds: 

•  Lower  bound:  d  >  maxjalogp, 4  +  /Dog p}  for  some  constants  a  >  n2  / 2,(5  >  n  —  1. 

•  Upper  bound: 

Intuitively,  the  lower  bound  on  the  degree  is  required  to  show  that  the  corresponding  bipartite 
graph  G(Vh,V0;  A)  has  sufficient  number  of  random  edges  to  ensure  that  it  has  perfect  n- gram 
matching  with  high  probability.  The  upper  bound  on  the  degree  is  mainly  required  to  satisfy  the 
krank  condition  3  where  dmax(A)n  <  krank(A). 

It  is  important  to  see  that,  for  n  >  2,  the  above  condition  on  degree  d  covers  a  range  of  models 
from  sparse  to  intermediate  regimes  and  it  is  reasonable  in  a  number  of  applications  that  each  topic 
does  not  generate  a  very  large  number  of  words. 

Definition  6  (whp).  A  sequence  of  events  £p  occurs  with  high  probability  (whp)  iflimp^.00'Pr(£p)  = 

1. 

The  main  random  identifiability  result  for  the  model  described  in  Section  2  is  stated  in  the  following 
theorem  for  n  >  2,  while  n  =  1  case  is  addressed  in  Remark  5.  The  identifiability  result  relies  on 
having  the  (2rn)-th  order  moment  of  observed  variables  xi,l  £  [2 rn],  defined  in  equation  (2) 
as 


AIr2'f/fi (x(  . —  E  [^(^r  ^  X2  ■  ■  ■  (D  ^m)(^'r?i+i  (^)  •  •  •  (D  X2 rif)  ]  ^  ^  , 

for  some  integer  r  >  1. 

Theorem  2  (Random  identifiability).  Let  M^n(x)  in  equation  (2)  be  the  (2 rn)-th  order  observed 
moment  of  the  n-persistent  topic  model  for  some  integer  r  >  1.  If  the  model  with  random  pop¬ 
ulation  structure  A  satisfies  conditions  1,  4  and  5,  then  whp,  for  any  n  >  2,  all  the  columns  of 
population  structure  A  are  identifiable  from  M^fix).  Furthermore,  the  (2 r)-th  order  moment  of 
hidden  variables,  denoted  by  Al2r(h),  is  also  identifiable,  whp. 

The  theorem  is  proved  in  Appendix  B.  Similar  to  the  deterministic  analysis,  it  is  seen  that  the 
population  structure  A  is  identifiable  given  any  observed  moment  with  order  at  least  2 n.  Increasing 
the  order  of  observed  moment  results  in  identifying  higher  order  moments  of  the  hidden  variables. 
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The  above  identifiability  theorem  only  covers  for  n  >  2  and  the  n  =  1  case  is  addressed  in  the 
following  remark. 

Remark  5  (Bag-of-words  admixture  model).  The  identifiability  result  for  the  random  bag-of-words 
admixture  model  is  comparable  with  the  result  in  [37]  which  is  about  exact  recovery  of  sparsely-used 
dictionaries.  They  assume  that  Y  =  AX  is  given  for  some  unknown  arbitrary  dictionary  A  and 
unknown  random  sparse  coefficient  matrix  X .  They  establish  that  if  the  random  sparse  coefficient 
matrix  X  follows  the  Bernoulli- Subgaussian  model  with  size  constraint  p  >  Cqlogq  and  degree 
constraint  O(logg)  <  E[d]  <  O(qlogq),  then  the  model  is  identifiable,  whp.  Comparing  the  size 
and  degree  constraints,  our  identifibility  result  for  n  >  2  requires  more  stringent  upper  bound  on  the 
degree,  while  more  relaxed  condition  on  the  size  which  allows  to  identifiability  in  the  overcomplete 
regime. 

3.3  Algorithm 

According  to  the  proof  of  identifiability  result  provided  in  Appendix  A.l,  columns  of  the  n-gram 
matrix  An_gram),  defined  in  Definition  7,  are  the  sparsest  and  rank-1  (in  the  tensor  form)  vectors  in 

Col^Mj”)  (x)^ .  This  identifiability  result  can  be  used  to  recover  the  columns  of  A  by  an  exhaustive 
search  which  is  not  efficient.  More  efficient  algorithm  provided  in  [36,37]  for  the  special  case  of 
n  =  1,  can  be  used  to  recover  population  structure  A  with  appropriate  slight  modifications.  The 
proposed  algorithm  is  a  convex  optimization  program  which  requires  some  sufficient  conditions  to 
succeed  in  recovering  A.  In  our  setting,  the  proposed  sufficient  conditions  for  exact  recovery  needs 
to  be  imposed  on  An-sram)  instead  of  A.  The  main  condition  imposes  that  each  column  of  J4in_gram) 
contains  at  least  one  entry  that  has  the  maximum  absolute  value  in  its  row.  Then,  it  is  shown  that 
under  some  additional  sufficient  conditions,  the  algorithm  succeeds.  See  [36,37]  for  details. 


4  Relationship  to  Tensor  Decomposition 

In  this  section,  we  first  characterize  the  moments  of  the  n-persistent  topic  model  in  terms  of  the 
model  parameters,  i.e.  the  topic-word  matrix  A  and  the  moment  of  hidden  variables.  Then,  we 
discuss  the  special  cases  of  this  model,  viz.,  the  single  topic  model  (infinite-persistent  topic  model) 
and  the  bag-of-words  admixture  model  (1-persistent  topic  model).  Then,  we  obtain  tensor  forms 
for  the  moments  of  the  topic  model  and  discuss  the  relationship  to  the  popular  CP  and  Tucker 
tensor  decompositions. 

4.1  Moment  characterization  of  the  persistent  topic  model 

The  n-gram  matrix  is  defined  as  follows. 

Definition  7  (n-gram  Matrix).  For  any  matrix  A  €  E.pxq,  n-gram  matrix  A.(n-sram)  €  M.pUxq  is 
defined  as  the  matrix  whose  ((?'i,  •  •  •  ,in),j)-th  entry  is  given  by 

i4(n-gram)((ii,  ,  ^  .=  A^A^  •  •  •  AinJ 
for  all  (h, . . . ,  in)  €  \p]n  and  j  €  [<?]. 
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(a)  Single  topic  model 
(infinite-persistent  topic  model) 


(b)  Bag-of-words  admixture  model 
(1-persistent  topic  model) 


Figure  3:  Hierarchical  structure  of  the  single  topic  model  and  bag-of-words  admixture  model  shown  for  2 m 
number  of  words  (views). 


That  is,  ^4(n"gram)  is  the  column- wise  n-th  order  Kronecker  product  (or  the  Khatri-Rao  product)  [34] 
of  n  copies  of  A. 

We  now  characterize  the  observed  moments  of  a  persistent  topic  model.  Throughout  this  section, 
the  number  of  observed  variables  is  fixed  to  2m. 

Lemma  1  (n-persistent  topic  model  moment  characterization).  The  (2m) -th  order  moment  of 
observed  variables,  defined  in  equation  (2),  for  the  n-persistent  topic  model  is  characterized  as: 

•  if  m  =  rn  for  some  integer  r  >  l,  then 

r  times  r  times 

M^(x)  =  ^(n-gram)  <8)  •  •  •  ©  A(n~ gram^  M2r(h)  ( A(n-giam )  (8)  •  •  •  <8>  ,  (5) 

where  M2r(h)  £  M.q  x  9  is  the  (2 r)-th  order  moment  of  hidden  variables  h  £  M.q ,  defined  in 
equation  (3). 

•  If  n  >  2m,  then 


m  times  m  times 

M2m  ( * )  =  (a©  Ml  (h)  (aq  ©^  T  (6) 

where  M\{h)  :=  Diag(E[/i])  £  M9X'?  is  the  first  order  moment  of  hidden  variables  h  £  Mq, 
stacked  in  a  diagonal  matrix. 


Comparison  with  single  topic  model  and  bag-of-words  admixture  model 

In  this  section,  the  proposed  n-persistent  topic  model  in  Section  2.1  is  compared  with  the  single 
topic  model  (n  — >•  oo)  in  Figure  3a  and  the  bag-of-words  admixture  model  (n  =  1)  in  Figure  3b.  In 
order  to  have  a  fair  comparison,  the  number  of  observed  variables  is  fixed  to  2m  and  the  persistence 
level  is  varied. 

Single  topic  model  (n  — >  oo):  The  moment  of  single  topic  model  where  n  -£  oc  is  characterized 
by  equation  (6).  As  expected,  this  moment  form  is  more  “structured”  than  the  moment  of  n- 
persistent  topic  model  in  equation  (5).  Note  that  the  involved  moment  of  hidden  variables  in  the 
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single  topic  model,  is  diagonal.  Moreover,  the  observed  moment  of  the  single  topic  model  only 
involves  Khatri-Rao  products  of  the  population  structure  A,  while  the  observed  moment  of  the  n- 
persistent  topic  model  also  involves  Kronecker  products  of  the  n-gram  matrix  ybn-sram) .  Therefore, 
the  n-persistent  topic  model  is  more  general  than  the  single  topic  model,  and  is  still  identifiable  in 
the  overcomplete  regime,  which  is  important. 

Bag-of- words  admixture  model  (n  =  1):  From  Lemma  1,  the  (2m) -th  order  moment  of 
observed  variables  xi,l  £  [2m],  for  the  bag-of- words  admixture  model  (1-persistent  topic  model) 
shown  in  Figure  3b  is  given  by 


m  times  m  times 


where  M2m(h)  £  is  the  (2m)-th  order  moment  of  hidden  variables  h  £  M9,  defined  in  (3). 

Why  persistence  helps  in  identifiability  of  overcomplete  models?  Comparing  equations 
(7)  and  (5),  it  is  seen  that,  the  n-persistent  topic  model  has  a  more  succinct  representation  of 
the  (2m)-th  order  moment  of  the  observed  variables  which  is  crucial  for  providing  identifiability 
in  the  overcomplete  regime.  More  number  of  Kronecker  products  are  involved  in  the  bag-of-words 
admixture  model  in  contrast  to  the  n-persistent  topic  model. 

We  now  give  a  simple  example  to  illustrate  how  persistence  helps  in  providing  identifiability  in 
the  overcomplete  regime.  Consider  the  instances  r  =  l,n  =  2,  a  2-persistent  topic  model  and 
r  =  2,n  =  1,  a  bag-of-words  admixture  model.  From  equations  (5)  and  (7),  the  moments  of  these 
instances  are  respectively  characterized  as 

Afj2)(x)  =  (A  ©  A)E[hhT]  ( A  ©  Af, 

M^\x)  =  (A  ©  A)E[(h  ©  h)(h  ©  h)T]  ( A  ©  A)T . 


In  the  2-persistent  model,  by  the  Khatri-Rao  product,  the  number  of  columns  of  the  resulting  matrix 
A  ©  A  £  xq  is  the  same  as  the  number  of  columns  of  the  original  matrix  A.  while  the  number 
of  rows  is  increased.  The  columns  of  A  0  A  are  indexed  by  the  first  order  hidden  variables  and 
the  rows  are  indexed  by  the  second  order  observed  variables.  Therefore,  the  Khatri-Rao  product 
expands  the  effect  of  hidden  variables  to  higher  order  observed  variables.  In  general,  it  is  done  by 
retaining  the  order  of  involved  hidden  variables  (retaining  the  number  of  columns  in  the  resulting 
matrix  ^4(n-gram))  while  increasing  the  order  of  involved  observed  variables  (increasing  the  number 
of  rows  in  the  resulting  matrix  d^n‘gram)).  This  kind  of  expansion  on  the  higher  order  observed 
variables  in  the  persistent  models  is  the  key  which  helps  to  identify  the  model  in  the  overcomplete 
regime.  In  other  words,  the  original  overcomplete  representation  becomes  determined  by  expanding 
the  effect  of  (retained  order)  hidden  variables  to  higher  order  observed  variables. 

On  the  other  hand,  in  the  bag-of-words  admixture  model,  this  interesting  expansion  property  does 
not  happen  where  the  Kronecker  product  A  ©  A  £  Wp"xg  is  incorporated.  Kronecker  product 
increases  both  the  order  of  involved  hidden  variables  and  observed  variables  by  the  same  amount. 
Therefore,  for  the  regular  admixture  model  (with  persistence  level  1),  it  is  not  possible  to  identify 
its  population  structure  A  in  the  overcomplete  regime. 


The  above  discussion  can  be  also  generalized  to  the  general  case  of  moments  in  equations  (5)  and 

(7). 


14 


4.2  Tensor  representation  of  the  model 

In  this  section,  we  derive  the  tensor  algebra  of  the  moments  derived  in  Section  4.1  for  the  persis¬ 
tent  topic  model.  We  compare  the  tensor  form  with  the  well-known  Tucker  and  CP  decomposi¬ 
tions. 


Tensor  algebra  preliminaries 

A  real-valued  order-n  tensor  A  G  (^)"=1  :=  MPlX"'XPn  is  an  dimensional  array  A(  1  :  p  i, . . . ,  1  : 

pn)  where  the  i-th  mode  is  indexed  from  1  to  pi.  In  this  paper,  we  restrict  ourselves  to  the  case 
that  pi  =  ■  ■  ■  =  pn  =  Pi  and  simply  write  A  G  (££)"  Wp.  A  fiber  of  a  tensor  A  is  a  vector  obtained  by 
fixing  all  indices  of  A  except  one,  e.g.,  for  A  G  (^)4  R3,  the  vector  /  =  A( 2, 1  :  3, 3, 1)  is  a  fiber. 

The  tensor  A  G  is  stacked  in  a  vector  a  G  Mp  by  the  vec(-)  operator  defined  as 

a  =  vec(A)  4=>  a((h  -  1  )pn~1  +  (i2  -  1  )pn~2  H - h  (in- 1  -  1  )p  +  in))  =  ■  ■  -fin)- 

The  inverse  of  a  =  vec(A)  operation  is  denoted  by  A  =  ten(a). 

For  vectors  a*  G  WPi,i  G  [n],  the  tensor  outer  product  operator  “o”  is  dehned  as  [34] 

n 

A  =  Oi  O  a2  o  ■  ■  ■  o  an  G  <=>  A(h,i  2,  ...,in):=  ai(*i)a2(*2)  •  •  -an(in).  (8) 

1=1 

The  above  generated  tensor  is  a  rank-1  tensor.  The  tensor  rank  is  the  minimal  number  of  rank-1 
tensors  into  which  a  tensor  can  be  decomposed5. 

In  general,  the  outer  product  operation  is  a  way  to  combine  lower  order  tensors  to  construct  higher 
order  tensors,  e.g.,  for  B  G  MPl xp2 ,  C  G  MP3Xp4,  the  4-th  order  tensor  A  =  B  o  C  G  (^)4=1  MPi  is 
defined  as  A(ii,i2,i3,H)  ■=  B(i\,i2)C{iz,ifi). 

According  to  above  definitions,  for  any  set  of  vectors  a*  G  M.Pi,i  G  [n],  we  have  the  following  pair 
of  equalities: 


vec(ai  o  a2  o  ■  ■  ■  o  an)  =  a\  <S>  a2  <8>  •  •  •  <8>  an, 
ten(oi  <S>  a2  <8>  •  •  •  <S>  an )  =  a±  o  a2  o  •  •  •  o  an. 


For  any  vector  a  G  Mp,  the  power  notations  are  also  defined  as 

n  times 

a  <S>  a  <8>  •  •  •  <g)  a  G  , 

n 

a  o  a  o  ■  ■  ■  o  a  G  (^)  Mp. 

The  second  power  is  usually  called  the  n-the  order  tensor  power  of  vector  a. 

Finally,  the  CP  (CANDECOMP/PARAFAC)  and  Tucker  representations  and  the  Kruskal  form 
notation  are  defined  as  follows  [34], 

5This  type  of  rank  is  called  CP  (CANDECOMP/PARAFAC)  tensor  rank  in  the  literature  [34]. 
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Definition  8  (CP  representation  and  Kruskal  form).  Given  A  G  W,Ui  G  RPiXr,i  €  [n],  the  n-th 
order  tensor  A  G  MPi  is  defined  in  the  Kruskal  form  as 

r 

A  =  [[A;  Ui,U2,  •  •  • ,  Un]\  ■.=  J2^iUi(-.,i)oU2(:fi)o...oUn(:,i),  (9) 

i= 1 

where  Uj(:,i )  denotes  the  i-th  column  of  matrix  Uj.  The  above  representation  of  tensor  A  is  called 
the  CP  representation  (decomposition)  where  the  tensor  A  is  written  as  a  weighted  sum  of  rank-1 
tensors. 

More  generally,  the  Tucker  representation  is  defined  as  follows. 

Definition  9  (Tucker  representation).  Given  a  core  tensor  S  G  (^)’l=i  lr'  and  inverse  factors 
Ui  G  MPiXri,i  G  [n],  the  Tucker  representation  of  n-th  order  tensor  A  G 

ri  T2  rn 

A  =  S(*i,i2, . .  -  ,in)Ui(:,ii)  °  U2(:,i2)  O  •  •  •  o  Un(:,in),  (10) 

*1  =  1  *2  =  1  *n=l 

where  Uj(:,ij)  denotes  the  ij-th  column  of  matrix  JJj.  With  a  slight  abuse  of  notation,  the  above 
Tucker  representation  can  be  also  denoted  in  the  form  A  =  [[S';  U\,  U2, . . . ,  Un]\. 

Note  that  the  CP  representation  is  a  special  case  of  the  Tucker  representation  when  the  core  tensor 
S  is  square  and  diagonal. 


Tensor  representation  of  moments  under  topic  model 

The  (2m)-th  order  moment  of  the  words  xi,l  G  [2m],  is  defined  as  (in  the  tensor  form) 

■■=  K[xi(h)x2(i2)  ■  ■  ■  x2m(i2m)],  i\,i2, ... ,  i2m  G  \p],  (11) 

where  T2m(x)  G  )2nl  Mp.  For  the  n-persistent,  topic  model  with  2m  number  of  observations  (words) 
G  [2m],  the  corresponding  moment  is  denoted  by  T^fix),  which  is  the  tensor  form  of  moment 
M2 m(x)  characterized  in  Lemma  1.  This  tensor  is  characterized  in  the  following  lemma,  proved  in 
Appendix  A. 2. 

Lemma  2  (n-persistent  topic  model  moment  characterization  in  tensor  form).  The  (2m) -th  order 
moment  of  words,  defined  in  equation  (11),  for  the  n-persistent  topic  model  is  characterized  as: 

•  if  m  =  rn  for  some  integer  r  >  1,  then 

ifiw  =  E  T,  ■ "  E  -  -  -  '■fa.K”  <12) 

l\  =  1  22  =  1  22r  =  1 

•  If  n  >  2m,  then 

iffto  =  E  EN«fm-  (13) 

*e[?] 

The  tensor  representation  (12)  is  a  specific  type  of  tensor  decomposition  which  is  a  special  case  of 
the  Tucker  representation,  but  more  general  than  the  CP  representation. 
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Comparison  with  single  topic  models  and  bag-of-words  admixture  model 

The  tensor  representation  of  our  model  provided  in  equation  (12)  is  a  special  case  of  the  Tucker 
representation  but  more  general  than  the  symmetric  CP  representation.  In  order  to  have  a  fair 
comparison,  the  number  of  observed  variables  is  fixed  to  2m  and  the  persistence  level  is  varied. 

CP  representation  of  the  single  topic  model:  The  (2m)-th  order  moment  of  the  words  for 
the  single  topic  model  (infinite-persistent  topic  model)  is  provided  in  equation  (13)  as 

2m  times 

T^\x)  =  £  m\afm  =  [[EN;dX^], 

ie[q] 

where  the  Kruskal  notation,  defined  in  Definition  8,  is  used  in  the  last  equality.  This  representation 
is  exactly  the  symmetric  CP  representation  (decomposition)  of  T^\x)  where  A*  =  E [hi],i  £  [(/], 
and  Ui  =  A,  i  £  [2m]. 

Tucker  representation  of  the  bag-of-words  admixture  model:  From  Lemma  2,  the  tensor 
form  of  the  (2m)-th  order  moment  of  observed  variables  x\,  l  £  [2m],  for  the  bag-of-words  admixture 
model  (1-persistent  topic  model)  is  given  by 

ii  i 

T2 m  (X)  =  EE'"  5Z  Eihiihi2  •  •  •  hi2m]ah  O  ai2  O  •  •  •  o  ai2m 

il  =  \i2  =  l  Z2m  =  l 

2m  times 

=  [e  [/i°(2m)];[A,  T,E.,yi]l, 

where  the  Kruskal  notation  dehned  in  Definition  9,  is  used  in  the  last  equality.  This  representation 
is  exactly  Tucker  representation  (decomposition)  of  Em  (x0  where  the  core  tensor  S  =  E[/i°(2m)]  is 
the  tensor  form  of  the  (2m)-th  order  hidden  moment  M2m(/i),  defined  in  equation  (3).  Furthermore, 
the  inverse  factors  Ui  =  A,i  £  [2m],  correspond  to  the  population  structure  A. 

On  lines  of  discussion  in  Section  4.1,  above  general  Tucker  decomposition  is  not  identifiable  in  the 
overcomplete  regime,  while  the  proposed  tensor  decomposition  in  equation  (12)  is  identifiable  under 
the  sufficient  conditions  provided  in  Section  3. 


5  Proof  Techniques  and  Auxiliary  Results 

The  main  identifiability  results  are  provided  for  both  deterministic  and  random  cases  of  topic- 
word  graph  structure,  in  Sections  3.1  and  3.2  respectively.  In  this  section,  we  first  provide  the 
proof  sketch  of  these  results  and  then,  we  propose  two  auxiliary  results  on  the  existence  of  perfect 
n-gram  matching  for  random  bipartite  graphs  and  lower  bound  on  the  Kruskal  rank  of  random 
matrices. 
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5.1  Proof  sketch 


The  deterministic  analysis  is  primarily  based  on  conditions  on  the  n-grarn  matrix  d.(n-sram) ;  but 
since  these  conditions  (mainly  expansion  condition  on  T(n-gra,m) .  provided  in  condition  7)  are 
opaque,  this  analysis  is  postponed  to  Appendix  A.l,  where  the  identifiability  result  is  stated  in 
Theorem  6.  In  the  following,  first,  we  provide  a  summary  of  the  hierarchical  relationships  among 
all  of  these  identifiability  results  and  the  corresponding  conditions.  Then,  according  to  this  hierar¬ 
chy,  a  proof  sketch  of  each  result  is  stated. 

Summary  of  relationships  among  different  conditions:  To  summarize,  there  exists  a  hier¬ 
archy  among  the  proposed  conditions  as  follows.  First,  in  the  random  analysis,  the  size  and  the 
degree  conditions  4  and  5  are  sufficient  for  satisfying  the  perfect  n-grarn  matching  and  the  krank 
conditions  2  and  3,  shown  by  Theorems  4  and  5.  Then,  these  conditions  2  and  3  ensure  that  the 
rank  and  the  expansion  conditions  6  and  7  hold,  shown  by  Lemma  5.  And  finally,  these  conditions 
6  and  7  together  with  non-degeneracy  condition  1  conclude  the  primary  identifiability  result  in 
Theorem  6.  Note  that,  the  genericity  of  A  is  also  required  for  these  results  to  hold. 

Primary  deterministic  analysis  in  Theorem  6:  The  deterministic  analysis  in  Theorem 

6,  is  described  here  for  the  case  when  2 n  number  of  words  are  available  under  the  n-persistent 
topic  model.  From  equation  (5),  the  (2n)-th  order  moment  of  the  observed  variables  under  the 
n-persistent  topic  model  can  be  written  as 

m£\x)  =  (n_gram)^ E [hhT]  (A(n-gram))T  (14) 

The  question  is  whether  we  can  recover  A,  given  the  (x).  Obviously,  the  matrix  A  is  not 
identifiable  without  any  further  conditions.  First,  non-degeneracy  and  rank  conditions  (conditions 
1  and  6)  are  required.  Without  such  non-degeneracy  assumptions,  there  is  no  hope  for  identifiability. 
Assuming  these  two  conditions,  we  have  from  (14)  that 

Col  (m£}(x))  =  Col(ff(n-gram)). 

Therefore,  the  problem  of  recovering  A  from  (x)  reduces  to  finding  y[(n-gram)  jn  Col(A(n-gram)). 
Then,  it  is  shown  that  under  the  following  expansion  condition  on  A(n-gram)  and  the  genericity 
property,  matrix  A  is  identifiable  from  Col(A^n_gram^) .  The  expansion  condition  (refer  to  con¬ 
dition  7  for  a  more  detailed  statement),  imposes  the  following  property  on  the  bipartite  graph 
G(Vhi  V^n);  A.(n'gram))  6, 

>  |<S1  +  dmax  (ff  (n-sram)) ,  VS  C  Vh,  |S|  >  krank(A),  (15) 

where  dmax(^4^n"gram^)  is  the  maximum  node  degree  in  set  Vh,  and  the  restricted  version  of  n-gram 
matrix,  denoted  by  A^ggram\  is  defined  in  Definition  10.  The  identifiability  claim  is  proved  by 
showing  that  the  columns  of  A.(n"gram)  are  the  sparsest  and  rank-1  (in  the  tensor  form)  vectors  in 
Col(Ain_grami)  under  the  sufficient  expansion  in  (15)  and  genericity  conditions.  This  finishes  the 
proof  sketch  for  the  deterministic  identifiability  result  based  on  A.(n"gram),  proposed  in  Theorem 

6Vo(n)  denotes  all  ordered  n-tuples  generated  from  set  V0  :=  {1, .  . .  ,p}  which  indexes  the  rows  of  J4(7l'gramh 


A^(n-gram)  (S') 
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6.  Note  that  the  expansion  condition  (15)  is  a  more  relaxed  condition  compared  to  expansion 
condition  proposed  in  [36,37]  for  identifiability  in  the  undercomplete  regime.  For  a  more  detailed 
comparison,  refer  to  Remark  8  in  Appendix  A.l. 

Deterministic  analysis  in  Theorem  1:  Expansion  and  rank  conditions  in  Theorem  6  are 
imposed  on  the  n-grarn  matrix  y4(n-gram).  According  to  the  generalized  matching  notions,  defined 
in  Section  3.1,  sufficient  combinatorial  conditions  on  matrix  A  (conditions  2  and  3)  are  introduced 
which  ensure  that  the  expansion  and  rank  conditions  on  A(”"gram)  are  satisfied.  This  is  shown  in 
Lemma  5  using  the  observation  in  the  following  lemma. 

In  the  following  lemma  which  is  proved  in  Appendix  A. 3,  we  state  an  interesting  property  which 
relates  the  existence  of  a  perfect  matching  in  ^(n-§ram)  to  the  existence  of  a  perfect  n-gram  matching 
in  A. 

Lemma  3.  If  G(Y,  A;  A)  has  a  perfect  n- gram  matching,  then  G(Y,  X^  ■,  A^n~gia,m'))  has  a  perfect 
matching.  In  the  other  direction,  if  G(Y,  X^  ■,  A^n~giam'))  has  a  perfect  matching  M(n"gram);  then 
G(Y,  X\A)  has  a  perfect  n-gram  matching  under  the  following  condition  on  M(n'gram).  All  the 
matching  edges  ■  ■  ■  ,in))  €  Af(n_gram)  should  satisfy  i\  /  ii  /  •  •  •  in  for  all  j  G  Y.  In 

words,  the  matching  edges  should  be  connected  to  nodes  in  X^n\  which  are  indexed  by  tuples  of 
distinct  indices. 

Using  this  lemma,  condition  2  results  that  G(Y,  X yf(n_gram))  has  a  perfect  matching.  Then,  it  is 
straightforward  to  argue  that  the  expansion  and  rank  conditions  on  ^(n-gram)  are  satisfied,  which 
is  shown  in  Lemma  5  in  Appendix  A. 4.  This  leads  to  the  generic  identifiability  result  stated  in 
Theorem  1. 

Random  analysis  in  Theorem  2:  Finally,  the  identifiability  result  for  a  random  matrix  A  is 
provided  in  Theorem  2  in  Section  3.2.  Sufficient  size  and  degree  conditions  4  and  5  on  the  random 
matrix  A  are  proposed  such  that  the  deterministic  combinatorial  conditions  2  and  3  on  A,  are 
satisfied.  The  details  of  these  auxiliary  results  are  provided  in  the  following  two  subsequent  sections. 
In  Section  5.2,  it  is  proved  in  Theorem  4  that  a  random  bipartite  graph  satisfying  reasonable  size 
and  degree  constraints,  has  a  perfect  n-gram  matching  (condition  2),  whp.  Then,  a  lower  bound  on 
the  Kruskal  rank  of  a  random  matrix  A  under  size  and  degree  constraints  is  provided  in  Theorem 
5  in  Section  5.3  which  helps  to  satisfy  krank  condition  3.  Intuitions  on  why  such  size  and  degree 
conditions  are  required,  are  mentioned  in  Section  3.2  where  these  conditions  are  proposed. 


5.2  Existence  of  perfect  ?7-gram  matching  for  random  bipartite  graphs 

The  result  of  this  section  is  used  in  the  proof  of  Theorem  2,  but  since  the  result  is  interesting 
and  useful  by  itself,  we  also  propose  it  independently.  In  this  section,  it  is  shown  that  a  random 
bipartite  graph  satisfying  reasonable  size  and  degree  constraints,  proposed  earlier  in  conditions  4 
and  5,  has  a  perfect  n-gram  matching  whp. 

In  the  proof  of  the  necessary  size  condition  for  the  existence  of  perfect  n-gram  matching  proposed 
in  Remark  3,  we  provide  an  analysis  which  is  also  constructive,  i.e. ,  we  provide  a  deterministic 
greedy  method  to  construct  a  bipartite  graph  which  has  a  perfect  n-gram  matching  when  satisfying 
q  <  (J)) .  Now,  the  question  is  under  what  conditions  a  random  bipartite  graph  has  a  perfect  n-gram 
matching.  In  this  section,  this  question  is  answered  in  Theorem  4,  where  it  is  seen  that  size  bound 
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q  =  0(pn )  is  also  sufficient  for  the  existence  of  perfect  n-grarn  matching  in  a  random  bipartite 
graph. 

Before  proposing  our  result  on  the  existence  of  perfect  n-grarn  matching  in  random  bipartite  graphs, 
the  existing  results  on  the  existence  of  perfect  matching  in  random  bipartite  graphs  are  reviewed 
[38-40].  Here,  we  recap  the  result  of  [39]  , which  is  used  to  prove  the  existence  of  perfect  n-gram 
matching  in  random  bipartite  graphs.  Let  z i  and  c*  satisfy  [39] 


zi 


ezi  —  1 
d-  1  ’ 

zi 

d(l 


(16) 


where  each  node  i  €  Y  in  the  random  bipartite  graph  G(Y,  X\ E),  is  randomly  connected  to  d 
different  nodes  in  set  X. 

Theorem  3  (Existence  of  perfect  matching  for  random  bipartite  graphs,  [39]).  Consider  a  random 
bipartite  graph  G(Y,X;  E )  with  node  size  ratio  c  :=  and  d  >  3.  If  c  <  c* ,  then  whp,  there  exists 
a  perfect  matching  in  the  random  bipartite  graph  G(Y,X;E). 

Theorem  4  (Existence  of  perfect  ?r-gram  matching  for  random  bipartite  graphs).  Consider  a 
random  bipartite  graph  G(Y,X;E)  with  |Y|  =  q  nodes  on  the  left  side  and  |X|  =  p  nodes  on  the 
right  side.  Assume  that  it  satisfies  the  size  condition  q  <  {c-)n  (condition  4)  for  some  constant 
0  <  c  <  1  and  the  degree  condition  (degree  of  nodes  in  Y )  d  >  a  log  p  for  some  a  >  n2 / 2  (lower 
bound  in  condition  5).  Then,  whp,  there  exists  a  perfect  (Y -saturating)  n-gram  matching  in  the 
bipartite  graph  G(Y,X;E). 

Remark  6  (Necessity  of  the  proposed  size  bound).  It  is  crucial  to  see  that  the  size  bound  q  =  0(pn) 
in  the  proposed  random  result  for  the  existence  of  perfect  n-gram  matching  achieves  the  necessary 
size  bound  q  <  (([)  =  0(pn),  proposed  in  Remark  3. 

Remark  7  (Insufficiency  of  the  union  bound  argument).  It  is  easier  to  exploit  the  union  bound 
arguments  to  propose  random  bipartite  graphs  which  have  a  perfect  n-gram  matching  whp.  It  is 
proved  in  Appendix  B.l  that  if  d  >  n  and  the  size  constraint  |K|  =  0{\X\^~5)  for  some  5  >  0  is 
satisfied,  then  whp,  the  random  bipartite  graph  has  a  perfect  n-gram  matching.  Comparing  this 
result  with  ours  in  Theorem  4 ,  the  latter  has  a  better  size  scaling  while  the  former  has  a  better 
degree  scaling.  The  size  scaling  limitation  in  the  union  bound  argument  makes  it  unattractive.  In 
order  to  identify  the  population  structure  A  in  the  overcomplete  regime  where  |y|  =  0(\X\n),  we 
need  to  at  least  have  (4 n)-th  order  moment  according  to  the  union  bound  arguments,  while  it  is  only 
required  to  know  the  (2 n)-th  order  moment,  according  to  our  more  involved  arguments. 


5.3  Lower  bound  on  the  Kruskal  rank  of  random  matrices 

The  result  of  this  section  is  used  in  the  proof  of  Theorem  2.  In  the  following  theorem,  a  lower  bound 
on  the  Kruskal  rank  of  a  random  matrix  A  under  dimension  and  degree  constraints  is  provided, 
which  is  proved  in  Appendix  B.l. 

Theorem  5  (Lower  bound  on  the  Kruskal  rank  of  random  matrices).  Consider  a  random  matrix 
A  G  Wpxq  for  which  there  exist  d  (which  is  called  degree)  number  of  random  non-zero  entries  in 
each  column.  Assume  that  it  satisfies  size  condition  q  <  (c^)n  (condition  4)  and  degree  condition 
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d  >  4  +  /?  log  p  for  some  (3  >  n  —  1  (lower  bound  in  condition  5)  and  in  addition  A  is  generic.  Then, 
whp,  krank(A)  >  \ p . 

Acknowledgements 

The  authors  acknowledge  useful  discussions  with  Sina  Jafarpour,  Moses  Charikar,  and  Kamalika 
Chaudhuri.  A.  Anandkumar  is  supported  in  part  by  NSF  Career  award  CCF-1254106,  NSF  Award 
CCF-1219234,  AFOSR  Award  FA9550-  10-1-0310,  and  ARO  Award  W911NF-12-1-0404.  M.  Jan- 
zamin  is  supported  by  NSF  Award  CCF-1219234  and  ARO  Award  W911NF-12-1-0404. 


Appendix 

A  Proof  of  Deterministic  Identifiability  Result  (Theorem  1) 

First,  we  show  the  identifiability  result  under  an  alternative  set  of  conditions  on  the  n-gram  matrix, 
An -gram) ;  and  then,  we  show  that  the  conditions  of  Theorem  1  are  sufficient  for  this  alternative 
result. 


A.l  Deterministic  analysis  based  on  An  gram) 


In  this  section,  the  deterministic  identifiability  result  based  on  conditions  on  the  n-gram  matrix, 

^(n-gram).  -g  prov)(]e(] 

In  the  n-gram  matrix,  An-sram)  €  M.pnxq,  redundant  rows  exist.  Precisely,  if  some  row  of  An'gram) 
is  indexed  by  n-tuple  (?4,  ...,*„),*/  €  [p] ,  then  another  row  indexed  by  any  permutation  of  tuple 
(ii,...  ,in)  has  exactly  the  same  entries.  In  other  words,  since  multiplication  is  commutative,  the 
number  of  distinct  rows  of  A.(n"gram)  is  at  most  the  number  of  (potentially)  different  products 
AiltjAi2  j  •  •  •  j  in  a  column  j  £  [q].  Therefore,  the  number  of  distinct  rows  of  An-sram)  is  at 
most  (p+'(fi  )•  In  the  following  definition,  we  define  a  non-redundant  version  of  n-gram  matrix 
which  is  restricted  to  the  (potentially)  distinct  rows. 

Definition  10  (Restricted  n-gram  matrix).  For  any  matrix  A  €  Wpxq,  restricted  n-gram  matrix 
^Rest 1  am^  e  Msx,?,  s  =  is  defined  as  the  restricted  version  of  n-gram  matrix  A(n'gram)  € 

Mpn><9,  where  the  redundant  rows  o/ An'gram)  are  removed,  as  explained  above. 

Condition  6  (Rank  condition).  The  n-gram  matrix  y4(n-§ram)  is  full  column  rank. 

Condition  7  (Graph  expansion).  Let  G(Vh,  Vo"^;  yf(n_gram))  denote  the  bipartite  graph  with  vertex 
sets  Vh  corresponding  to  the  hidden  variables  (indexing  the  columns  of  A^n"gram^  and  Vo cor¬ 
responding  to  the  n-th  order  observed  variables  (indexing  the  rows  of  An_gram)/)  and  edge  matrix 
^(n-gram)  ^  jjivi  )|x|v_h|_  77^  bipartite  graph  G(Vh,  An"gram))  satisfies  the  following  expansion 
property  on  the  restricted  version  specified  by  A["eJram^ , 


iV.  (n-gram)  (5)  >|5|  +  ^max 

^Rest.  \ 


n  -gram) 


VS  C  Vh,  |5|  >  krank(A), 


(17) 
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where  o?max(J4(n-sram)) 


is  the  maximum  node  degree  in  set  14 . 


Remark  8.  The  expansion  condition  for  the  bag-of-words  admixture  model  is  provided  in  (4), 
introduced  in  [36].  The  proposed  expansion  condition  in  (17)  is  inherited  from  (4),  with  two  major 
modifications.  First,  the  condition  is  appropriately  generalized  for  our  model  which  involves  a 
graph  with  edges  specified  by  the  n-gram  matrix,  y44-gram)?  as  stated  in  (14).  Second,  the  expansion 
property  (4),  proposed  in  [36],  needs  to  be  satisfied  for  all  subsets  S  with  size  |5|  >  2,  which  is  a 
much  stricter  condition  than  the  one  proposed  here  in  (17),  since  we  can  have  krank(R)  S>  2.  Note 
that  because  of  the  dmax  term  in  the  expansion  property  in  (17),  it  is  hard  to  satisfy  (17)  for  small 
sets. 


The  deterministic  identifiability  result  for  the  model  described  in  Section  2,  based  on  the  conditions 

on  ^(n-gram) 

,  is  stated  in  the  following  theorem  for  n  >  2,  while  n  =  1  case  is  addressed  in  Remarks  4 
and  8.  This  is  actually  the  basic  result  from  which  the  main  deterministic  and  random  identifiability 
results  respectively  proposed  in  Theorems  1  and  2,  are  concluded.  The  identifiability  result  relies  on 
having  the  (2n)-th  order  moment  of  observed  variables  xi,l  £  [2n],  defined  in  equation  (2)  as 

M2n(x )  :=  E  [(xi  (g>  x2  <8>  •  •  •  <8>  xn)(xn+i  <g>  xn+2  <8>  •  •  •  <8>  x2 n)T]  G  Kpn><p". 

Theorem  6  (Generic  identifiability  under  deterministic  conditions  on  R(n_gram)).  Let  (x) 
(defined  in  equation  ( 2 ))  be  the  (2 n)-th  order  moment  of  the  n-persistent  topic  model  described  in 
Section  2.  If  the  model  satisfies  conditions  1,  6  and  7,  then,  for  any  n  >  2,  all  the  columns  of 
population  structure  A  are  generically  identifiable  from  (x) . 

Proof:  Define  B  :=  7f*-n“gram)  £  Mp,lx<b  Then,  the  moment  characterized  in  equation  (14)  can 

be  written  as  (x)  =  BE  [/i/iT]  BT .  Since  both  matrices  E  [/r/iT]  and  B  have  full  column 
rank  (from  conditions  1  and  6),  the  rank  of  RE  [hhT~\  BT  is  q  where  q  =  0(pn),  and  furthermore 
Col(RE  [hhT]  Bt )  =  Col(R).  Let  U  :=  {u\, . . .  ,uq}  £  Mp"  be  any  basis  of  Col(RE  [hhT]  BT) 
satisfying  the  following  two  properties: 

1)  uf  s  have  the  smallest  Iq  norms. 

2)  uf s  have  q  smallest  (tensor)  ranks  in  the  n-th  order  tensor  form,  i.e. ,  Ui  :  =  ten (uf),i  £  [g], 
have  q  smallest  ranks. 

Let  the  columns  of  matrix  B  be  6*  for  i  £  [g].  Since  all  the  bfs  (which  belong  to  Col(RE  [/ihT]  BT )) 
are  rank-1  in  the  n-th  order  tensor  form  (since  ten (6;)  =  a°n)  and  the  number  of  non-zero  entries 
in  each  of  bfs  is  at  most  dmax(B)  =  dmax(A)n,  we  conclude  that 


max Rank(ten(rq))  =  1  and  max||uj||o  <  dmax(B).  (18) 

i  i 

The  above  bounds  are  concluded  from  the  fact  that  6*  €  Col(RE  [hhT]  BT),  i  £  [g],  and  therefore 
the  Iq  norm  and  the  rank  properties  of  bfs  are  upper  bounds  for  the  corresponding  properties  of 
basis  vectors  uf  s  (according  to  the  proposed  conditions  for  uf  s). 

Now,  exploiting  these  observations  and  also  the  genericity  of  A  and  the  expansion  condition  7,  we 
show  that  the  basis  vectors  uf s  are  scaled  columns  of  B.  Since  Ui  for  i  £  [g],  is  a  vector  in  the 
column  space  of  B,  it  can  be  represented  as  ui  =  Bvi  for  some  vector  Vi  £  ML  Equivalently,  for 
any  i  £  [g],  m  =  J2j=i  viti)bj  where  bj  =  ajn  is  the  j-th  column  of  matrix  B  and  VjfJ)  is  a  scalar 
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which  is  the  j-th  entry  of  vector  Vi .  Then,  the  tensor  form  of  Ui  can  be  written  as 


ten(uj)  =  ^2vi(j)ten(bj)  =  ^  u*(j)  ten(afn)  =  ^Vj(j)a 


on 

'3 


3= 1 


3= 1 


3=1 


n  times 


(19) 


where  the  last  equality  is  based  on  the  Kruskal  form  notation  defined  in  Definition  8.  We  define 
Vi  :=  [vi(j)\j:Vi(j)^o  as  the  vector  which  contains  only  the  non-zero  entries  of  Vi,  i.e.,  V{  is  the 
restriction  of  vector  V{  to  its  support.  Therefore,  vt  G  Mr  where  r  :=  ||uj||o.  Furthermore,  the 
matrix  Aj  :=  {aj  :  Vi{j)  ^  0}  €  Wpxr  is  defined  as  the  restriction  of  A  to  its  columns  corresponding 
to  the  support  of  vt.  Let  (dn)j  denote  the  j-th  column  of  Ai.  According  to  these  definitions, 
equation  (19)  reduces  to 


n  times 

— A — r 

ten(uj)  =  [[vi-,Aii:^.,Ai]\  =  ui(j)[(ai)j]on,  (20) 

3=1 

which  is  derived  by  removing  columns  of  A  corresponding  to  the  zero  entries  in  V{. 

Next,  we  rule  out  the  case  that  ||uj||o  >  2  under  two  cases  (2  <  ||uj||o  <  krank(A)  and  krank(A)  < 
||vi||o  <  q ),  for  Ui  =  Bvi  equality  to  conclude  that  Ui  s  vectors  are  scaled  columns  of  B. 


Case  1:  2  <  ||uj||o  <  krank(A).  Here,  the  number  of  columns  of  Ai  £  MpxIKIIo  is  less  than  or 
equal  to  krank(A)  and  therefore  it  is  full  column  rank.  From  Fact  1,  rank-1  tensors  [(ai)j]on,  j  G  [?’], 
are  linearly  independent.  Hence,  for  any  n  >  2,  7  from  equation  (20),  we  have  R,ank(ten('u,;))  = 
r  =  |H|o  >  1,  which  contradicts  the  fact  that  max*  Rank(ten(-Uj))  =  1  in  (18). 


Case  2:  krank(A)  <  ||uj||o  <  q.  Here,  we  first  restrict  the  n-grarn  matrix  B  to  distinct  rows, 
denoted  by  -BRest.,  as  defined  in  Definition  10.  Let  u[  =  B^estVi.  Since  v!i  is  the  restricted  version 
of  Ui,  we  have 


Il'Willo  —  Iloilo  —  ||^RestTi||o 

>  |JVBR«t.(SuPP(ui))|  “  lSuPP(^i)l 

^  dmax(-B), 

where  the  second  inequality  is  from  the  genericity  of  A  used  in  Lemma  4,  and  the  third  inequality 
follows  from  the  graph  expansion  property  (condition  7).  This  result  contradicts  the  fact  that 
max*  || U*  ||o  <  dmax(-B)  in  (18). 

From  above  contradictions,  ||uj||o  =  1  and  hence,  columns  of  B  :=  A.in'grami  are  the  scaled  versions 
of  Ui  s.  □ 

The  following  lemma  is  useful  in  the  proof  of  Theorem  6.  The  result  proposed  in  this  lemma 
is  similar  to  the  parameter  genericity  condition  in  [36],  but  generalized  for  the  ?r-gram  matrix, 
A (n -gram),  xhe  lemma  can  be  also  proved  on  lines  of  the  proof  of  Remark  2.2  in  [36]. 

'Note  that  for  n  =  1,  since  the  (tensor)  rank  of  any  vector  is  1,  this  analysis  does  not  work. 
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Lemma  4.  If  A  £  Wpxq  is  generic,  then  the  n-gram  matrix  A^n  gram)  £  BA"*9  satisfies  the  following 
property  with  Lebesgue  measure  one.  For  any  vector  v  £  M9  with  ||u||o  >  2,  we  have 


A  (n-gram) 
^Rest.  V 


> 


N  (n-gram)  (Supp(v)) 


I  Supp(u)|, 


where  for  a  set  S  C  [g],  NA(n.gia.m)(S)  :  =  {i  £  [p]n  :  H(n"gram)(i,  j)  ^  0  /or  some  j  £  5}. 

Here,  we  prove  the  result  for  the  case  of  n  =  2.  The  proof  can  be  easily  generalized  to  larger 

n. 


Let  A  :=  M  +  Z  be  generic,  where  M  is  an  arbitrary  matrix,  perturbed  by  random  continuous 

2 

perturbations  Z.  Consider  the  2-gram  matrix  B  :=d0i€l!'  x?  .  It  is  shown  that  the  restricted 

' — '  p(p~\~  1)  W 

version  of  B,  denoted  by  B  :=  BRest.  £  R  2  /  satisfies  the  above  genericity  condition.  We  first 

establish  some  definitions. 

Definition  11.  We  call  a  vector  fully  dense  if  all  of  its  entries  are  non-zero. 

Definition  12.  We  say  a  matrix  has  the  Null  Space  Property  (NSP)  if  its  null  space  does  not 
contain  any  fully  dense  vector. 

Claim  1.  Fix  any  S  C  [q]  with  |£|  >  2,  and  set  R  :=  N  (2 -gram)  (5).  Let  C  be  a  |S|  x  |5|  submatrix 

____  __  MRest. 

of  Bii'S-  Then  Pr(C  has  the  NSP )  =  1. 

Proof  of  Claim  1:  First,  note  that  B  can  be  expanded  as 


B  :=  (A  Q  ^4) Rest.  =  (M  ©  M)Rest.  +  (Af  0  Z  +  Z  ©  M)Rest.  +  (Z  0  Z)Rest.  . 

" - v - ' 

:=U 


Let  s  =  |5|  and  let  C  =  [ci 1 62 1  •  •  •  |cs]T,  where  cf  is  the  *-th  row  of  C.  Also,  let  C  :=  [ci | C2 1  •  •  •  | cs]T 
and  W  :=  [zuil^l  •  •  •  | ws]T  be  the  corresponding  |5|  x  |5|  submatrices  of  and  U,  respec¬ 

tively.  For  each  i  £  [s],  denote  by  Nt  the  null  space  of  the  matrix  C\  =  [ci  1 62 1  •  •  •  | q]t.  Finally  let 
No  =  Rs.  Then,  No  A  Ni  A  •  •  •  D  Ns-  We  need  to  show  that,  with  probability  one,  Ns  does  not 
contain  any  fully  dense  vector. 

If  one  of  N,  i  £  [s],  does  not  contain  any  full  dense  vector,  the  result  is  proved.  Suppose  that 
N  contains  some  fully  dense  vector  v.  Since  C  is  a  submatrix  of  ^iam\  every  row  cf+1  of  C 
contains  at  least  one  non-zero  entry.  Therefore, 

vT Ci+ 1  =  Y  v(j)ci+i(j) 

fe[s] 

=  Y  (Ci+1  )  +  Wi+i  {j ) ) , 

je[s]:ci+1(j)^0 


where  {wi+\{j)  :  j  £  [s]  s.t.  Cj+i(j)  0}  are  independent  random  variables,  and  moreover,  they  are 
independent  of  ci, . . . ,  Cj  and  thus  of  v.  By  assumption  on  the  distribution  of  the  't/y+i(j), 


Pr 

v  £  Ni+i 

Ci,C2,  •  •  •  ,Ci 

=  Pr 

Y  vU)(ci+ l(i)  +  wi+l(j ))  =  0 

Cl,C2,  •  •  •  ,Ci 

- 

- 

-j£[s]:ci+1(j)^0 

- 
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Consequently, 


Pr  dim(A4;+i)  <  dim(jVi)  ci,  c2, . . . ,  Cj  =1  (22) 

for  all  i  =  0, . . . ,  s  —  1.  As  a  result,  with  probability  one,  dim(A4)  =  0.  □ 

Now,  we  are  ready  to  prove  Lemma  4. 

Proof  of  Lemma  f:  It  follows  from  Claim  1  that,  with  probability  one,  the  following  event  holds: 
for  every  S  C  [g],  IS)  >  2,  and  every  |S|  X  |S|  submatrix  C  of  Bu  s  where  R  :=  N .,(2 -gram)  (5),  then 

MRest. 

C  has  the  NSP. 

Now  fix  v  £  Rq  with  ||u||o  >  2.  Let  S  :=  Supp(u)  and  H  :=  Furthermore,  let  u  £  (M\{0})l5l 

be  the  restriction  of  vector  v  to  S;  observe  that  u  is  fully  dense.  It  is  clear  that  ||-Bu||o  =  ||iLu||o, 
so  we  need  to  show  that 

Iloilo  >  |i*|  -  |S|.  (23) 

For  the  sake  of  contradiction,  suppose  that  Hu  has  at  most  \R\  —  |S|  non-zero  entries.  Since 
Hu  £  RlRl,  there  is  a  subset  of  |S|  entries  on  which  Hu  is  zero.  This  corresponds  to  a  |S|  x  |S| 
submatrix  of  H  :=  B^s  which  contains  u  in  its  null  space.  It  means  that  this  submatrix  does 
not  have  the  NSP,  which  is  a  contradiction.  Therefore  we  conclude  that  Hu  must  have  more  than 
|i?|  —  | S' |  non-zero  entries,  which  finishes  the  proof.  □ 

A. 2  Proof  of  moment  characterization  lemmata 

Proof  of  Lemma  1:  First,  in  order  to  simplify  the  notation,  similar  to  tensor  powers  for  vectors, 
the  tensor  power  for  a  matrix  U  £  M.pxq  is  defined  as 

r  times 

U®r  :=  £  Wprxqr.  (24) 

First,  consider  the  case  m  =  rn  for  some  integer  r  >  1.  One  advantage  of  encoding  yj .  j  £  [2r], 
by  basis  vectors  appears  in  characterizing  the  conditional  moments.  The  first  order  conditional 
moment  of  words  xi,l  £  [2m],  in  the  n-persistent  topic  model  can  be  written  as 

E[x(j-i)n+k\yj]  =  A Vj,  J  e  [2 r\,  k  £  [n], 

where  A  =  [oi|a2|  ■  ■  ■  |ag]  £  MpXQ.  Next,  the  m-th  order  conditional  moment  of  different  views 
xi,  l  £  [m],  in  the  n-persistent  topic  model  can  be  written  as 

E[xi  <g>  x2  ®  ■  ■  ■  0  xm\yi  =  eh ,  y2  =  ei2 , . . . ,  yr  =  eir\  =  a®n  <g>  afj1  ®  ■  ■  ■  <g>  a®n, 

which  is  derived  from  the  conditional  independence  relationships  among  the  observations  xi,l  £ 
[m],  given  topics  yj,j  £  [r] .  Similar  to  the  first  order  moments,  since  vectors  y3 ,  j  £  [?’],  are 
encoded  by  the  basis  vectors  e*  £  M9,  the  above  moment  can  be  written  as  the  following  matrix 
multiplication 

E[xi  <g>  x2  <s>  ■  ■  ■  ®  xmfp1,p2,  ...,yr\  =  (A(n-gram))  (yi®y2®---®yr),  (25) 
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where  the  (-)®r  notation  is  defined  in  equation  (24).  Now  for  the  (2m)-th  order  moment,  we 
have 


(x)  :=  E  (aq  0  X2  0  •  •  •  0  xm)(xm+\  0  xm+2  0  •  •  •  0  22m)T 

=  E  [(aq  0  •  •  •  <8>  a:m)(xm+i  <8)  •  •  •  0  x2m)T\yi,  2/2,  •  •  •  ,2/2r] 

=  ®(2/l,2/2,— ,2/2r)  E[(X1  ®  '  '  '  0  xm)  |yi , . . . ,  y2r\ E  [{xm+i  0  •  •  •  0  aqm)T  I2/1 , . . . ,  2/2 r\  j 
—  E(3/i,y2,...,?/2r)  E[(xl  0  •  •  •  0  Xm)\yi,  ■  ■  •  ,  yr\ E  [(xm+i  0  •  •  •  0  X2m)T|2/r+l,  •  •  •  ,  J/2r] 

=  E(yilW)...)Wr)  (  [^4(n-gram)]  0r)  (2/1  0  •  •  •  0  Vr)  (j/r+1  0  •  •  •  0  2/2r)T  (  [A^^]  ^  T 

=  ( [^(n‘gram)] Sr)  E  [(yi  0  •  •  •  0  yr)  (j/r+i  0  •  •  •  0  y2,)T]  ( [A(-gram)]  T 

(J  -gram) -gram) ^  (26) 

where  (a)  results  from  the  independence  of  (aq, . . . ,  xm)  and  (xm+i, . . . ,  x2m)  given  (■ yi,y2 ,  •  •  • ,  y2r) 
and  (6)  is  concluded  from  the  independence  of  (aq, . . .  ,xm)  and  {yr+ 1,  •  •  • ,  y2r)  given  (y  1, . . . ,  yr) 
and  the  independence  of  (xm+i, . . . ,  x2m)  and  (2/1, . . . ,  yr)  given  (yr+i,  ■  ■  ■ ,  y2r)-  Equation  (25) 
is  used  in  (c)  and  finally,  the  (2r)-th  order  moment  of  (yi,---,y2r)  is  defined  as  M2r(y )  := 

E  (yi  0  •••  ®yr)  (yr+i  0  •••  0y2r)T  in  (d). 

On  the  other  hand,  for  M2r  (y) ,  we  have  by  the  law  of  total  expectation 

M2r{y)  :=  E[(yi  0  •  •  •  0  yr)  (yr+i  0  •  •  •  0  y2r)T] 

=  Eh  E[(yi  0  •  •  •  0  yr)  (yr+i  0  •  •  •  0  y2r)T  \ h] 

r  times  r  times 

r  // — ^ — a  /  , — ^ — a  t 

=  Eh  (  h  0  •  •  •  0  h)  (  h  0  •  •  •  0  h) 

=  M2r(h), 


where  the  third  equality  is  concluded  from  the  conditional  independence  of  variables  yj,j  £  [2r], 
given  h  and  the  model  assumption  that  E [y^ | /i]  =  h,j  £  [2 r\.  Substituting  this  in  equation  (26), 
finishes  the  proof  for  the  n-persistent  topic  model.  Similarly,  the  moment  of  single  topic  model 
(infinite  persistence)  can  be  also  derived.  □ 

Proof  of  Lemma  2:  Defining  A  :=  M2r[h)  £  Rf*?  and  B  :=  [A(n-sram)] 0r  £  ]RP™x0,  the  (2 rn)- 

th  order  moment  M^n(x)  £  RPr"xPrn  of  the  n-persistent  topic  model  proposed  in  equation  (5)  can 
be  written  as 

=  BABt. 


Let  £  Mprn  denote  the  corresponding  column  of  B  indexed  by  r-tuple  (q, . . . ,  ir),  q  £ 
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[q] .  k  £  [r] .  Then,  the  above  matrix  equation  can  be  expanded  as 


M2rl(X)  =  E  A((*!»  •  •  •  >*r)>  O'l,  •  •  •  ,  Jr)) %i *r)65i.— Jr) 

ji>-Jr€[g] 


E 


®agn][af1"®---®afrn] 


Jr 


*l,-.«Jre[g] 

ji,-,jr£[q\ 

where  relation  ^  =  a®n(g>-  •  -®a®n,zi, . . .  ,ir  £  [g],  is  used  in  the  last  equality.  Let  m^}n(x)  £ 
Mp2r"  denote  the  vectorized  form  of  (2rn)-th  order  moment  M^Ax)  £  Mpr™xpr".  Therefore,  we 


have 


:=yec 

=  E  A((*i>  • ' '  >  O,  0'i,  •  •  •  ,>)K"  ®  ■  ■  ■  0  afrn  ®  a®"  ®  •  •  •  ®  a®". 
u,...,ve[g] 
ilvUrSM 


Then,  we  have  the  following  equivalent  tensor  form  for  the  original  model  proposed  in  equation 

(5) 


T2rl(x )  := ten(m^(x)) 

=  E  A((^’  ■  ■  ■  .  ir)>  O'l.  •  •  -,>))<  0  •  •  •  °  C  °  <  °  °  <"• 

u,...,ire[g] 

,/'l 

□ 


A. 3  Proof  of  generalized  matching  properties 

Proof  of  Lemma  3:  We  show  that  if  G(Y,  X ;  A)  has  a  perfect  n-gram  matching,  then  G(Y,  X ^ ;  A”  ~gram) ) 
has  a  perfect  matching.  The  reverse  can  be  also  immediately  shown  by  reversing  the  discussion 
and  exploiting  the  additional  condition  stated  in  the  lemma. 

Let  £?(n"gram)  denote  the  edge  set  of  the  bipartite  graph  G(Y,  X^;  J4(n"gram)).  Assume  G(Y,  X;  A) 
has  a  perfect  n-gram  matching  M  C  E.  For  any  j  £  Y,  let  set  Nm (j )  denote  the  set  of  neigh¬ 
bors  of  vertex  j  according  to  edge  set  M.  Since  M  is  a  perfect  n-gram  matching,  \Nm(J)\  =  n 
for  all  j  £  Y.  It  can  be  immediately  concluded  from  Definition  3  that  sets  Nm(J)  are  all  dis¬ 
tinct,  i.e.,  Nm (j l )  /  NM(j2)  for  any  ji,j2  €  Y,J\  /  j2.  For  any  j  £  Y ,  let  N'M(j)  denote 
an  arbitrary  ordered  ?r-tuple  generated  from  the  elements  of  set  From  the  definition  of 

n-gram  matrix,  we  have  -sram) (N'M (j ),j)  /  0  for  all  j  £  Y.  Hence,  (j.  N'M(j))  £  £,(n"gram) 
for  all  j  £  Y  which  together  with  the  fact  that  all  N'M (j)'s  tuples  are  distinct,  it  results  that 
M(n- gram)  ,=  {(j,  N'M (j)) \ j  £  Y}  C  E^' gram)  is  a  perfect  matching  for  G ( Y,  A M ;  A (n -gram) ) . 

□ 

Proof  of  Remark  3:  In  order  to  show  this,  we  fix  the  dimension  of  vertex  set  J  top  and  see  what 
the  maximum  number  of  vertices  in  set  Y  could  be  such  that  the  resulting  bipartite  graph  still  has 
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a  perfect  n-gram  matching.  Therefore,  assume  we  have  p  vertices  in  X  and  an  empty  vertex  set 
Y  on  the  other  side.  We  want  to  introduce  vertices  in  set  Y  with  degree  n  such  that  the  resulting 
bipartite  graph  has  a  perfect  n-gram  matching.  In  order  to  satisfy  this  property,  for  any  subset  of 
vertices  S  Cl  with  |<S|  =  n,  we  introduce  a  new  vertex  in  set  Y  to  ensure  it  has  a  perfect  n-gram 
matching.  Hence,  we  can  introduce  up  to  (p)  vertices  in  Y.  □ 

A. 4  Sufficient  matching  properties  for  satisfying  rank  and  graph  expansion  con¬ 

ditions 

In  the  following  lemma,  it  is  shown  that  under  having  a  perfect  n-gram  matching  and  additional 
genericity  and  krank  conditions,  the  rank  and  graph  expansion  conditions  6  and  7  on  A(n'gram),  are 
satisfied. 

Lemma  5.  Assume  that  the  bipartite  graph  G(Vh,V0;  A)  has  a  perfect  n-gram  matching  (condition 
2  is  satisfied).  Then,  the  following  results  hold  for  the  n-gram  matrix  H(n_gram); 

1)  If  A  is  generic,  An"gram)  is  full  column  rank  (condition  6)  with  Lebesgue  measure  one  (almost 
surely). 

2)  If  krank  condition  3  holds,  An"grand  satisfies  the  proposed  expansion  property  in  condition  7. 

Proof:  Let  M  indicate  the  perfect  n-gram  matching  of  the  bipartite  graph  G(Vh,  VQ;  A).  From 

Lemma  3,  there  exists  a  perfect  matching  M^n_gram)  for  the  bipartite  graph  G(Vh,  v},'1'1 ;  An"gram)). 
Denote  the  corresponding  bi-adjacency  matrix  to  the  edge  set  M  as  Am-  Similarly,  %  denotes 
the  corresponding  bi-adiacency  matrix  to  the  edge  set  M(n~gram> .  Note  that  Supp(Avf)  C  Supp(yl) 
and  Supp(Hm)  C  Supp(An-gram)). 

Since  Bm  is  a  perfect  matching,  it  consists  of  q  :=  |I4  |  rows,  each  of  which  has  only  one  non-zero 
entry,  and  furthermore,  the  non-zero  entries  are  in  q  different  columns.  Therefore,  these  rows  form 
q  linearly  independent  vectors.  Since  the  row  rank  and  column  rank  of  a  matrix  are  equal,  and  the 
number  of  columns  of  Bm  is  q,  the  column  rank  of  Bm  is  q  or  in  other  words,  Bm  is  full  column 
rank.  Since  A  is  generic,  from  Lemma  6  (with  a  slight  modification  in  the  analysis8),  An-gram)  is 
also  full  column  rank  with  Lebesgue  measure  one  (almost  surely).  This  completes  the  proof  of  part 
1. 

Next,  the  second  part  is  proved.  From  krank  definition,  we  have 

\Na(S')\  >  1 5' |  for  S'  C  14,  \S'\  <  krank(H), 

which  is  concluded  from  the  fact  that  the  corresponding  submatrix  of  A  specified  by  S'  should  be 
full  column  rank.  From  this  inequality,  we  have 

|Aa(5')|  >  krank(H)  for  S'  C  14, 1 5" |  =  krank(A).  (27) 

Then,  we  have 

\Na(S)\  >  |W4(S')|  for  S'  CSC  Vh,  |S|  >  krank(H),  \S'\  =  krank(A), 

8Lemma  6  result  is  about  the  column  rank  of  A  itself,  but  here  it  is  about  the  column  rank  of  A(n-eTarn)  for  which 
the  same  analysis  works.  Note  that  the  support  of  Bm  (which  is  full  column  rank  here)  is  within  the  support  of 
^(n-gram)  and  therefore  Lemma  6  can  still  be  applied. 
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>  krank(^4) 

>  dmax(A)n,  (28) 

where  (27)  is  used  in  the  second  inequality  and  the  last  inequality  is  from  krank  condition  3. 

In  the  restricted  n-gram  matrix  Aj^s®ram  ,  the  number  of  neighbors  for  a  set  S  C  Vh,  1*51  >  krank(A), 
can  be  bounded  as 

N  .(n-gram)  (*S)  >  \NA(S)  \  +  |S| 

Rest. 

>  |5|  +  dmax(A)n  for  |5|  >  krank(A), 

where  the  first  inequality  is  concluded  from  the  existence  of  a  perfect  n-gram  matching  in  A.  and 
the  bound  (28)  is  used  in  the  second  inequality.  Since  dmax(^4('n"sram^)  =  dmax(A)n,  the  proof  of 
part  2  is  also  completed. 

□ 

Remark  9.  The  second  result  of  above  lemma  is  similar  to  the  necessity  argument  of  (Hall’s) 
Theorem  7  for  the  existence  of  perfect  matching  in  a  bipartite  graph,  but  generalized  to  the  case  of 
perfect  n-gram  matching  and  with  additional  krank  condition  which  is  expected  since  the  expansion 
condition  proposed  here  is  stricter  than  the  one  in  Hall’s  theorem. 

A. 5  (Auxiliary)  lemmata  and  facts 

Lemma  6.  Consider  matrix  C  G  Mmxr  which  is  generic.  Let  C  G  RmXT'  such  that  Supp(C')  C 
Supp(C)  and  the  non- zero  entries  of  C  are  the  same  as  the  corresponding  non- zero  entries  of  C . 
If  C  is  full  column  rank,  then  C  is  also  full  column  rank,  almost  surely. 

Proof:  Since  C  is  full  column  rank,  there  exists  a  r  x  r  submatrix  of  C ,  denoted  by  Cs,  with 

non-zero  determinant,  i.e.,  det(C's')  /  0.  Let  Cs  denote  the  corresponding  submatrix  of  C  indexed 
by  the  same  rows  and  columns  as  Cs- 

The  determinant  of  Cs  is  a  polynomial  in  the  entries  of  Cs-  Since  Cs  can  be  derived  from  Cs  by 
keeping  the  corresponding  non-zero  entries,  det(C's')  can  be  decomposed  into  two  terms  as 

det(C's)  =  det(Cs)  +  f(Cs), 

where  the  first  term  corresponds  to  the  monomials  for  which  all  the  variables  (entries  of  Cs)  are 
also  in  Cs  and  the  second  term  corresponds  to  the  monomials  for  which  at  least  one  variable  is 
not  in  Cs-  The  first  term  is  non-zero  as  stated  earlier.  Since  C  is  generic,  the  polynomial  f{Cs)  is 
non-trivial  and  therefore  its  roots  have  Lebesgue  measure  zero.  It  implies  that  det(C5)  0  with 
Lebesgue  measure  one  (almost  surely),  and  hence,  it  is  full  (column)  rank.  Thus,  C  is  also  full 
column  rank,  almost  surely.  □ 

Fact  1.  If  vectors  wt  G  G  [r]  are  linearly  independent,  then  the  n-th  order  tensor  powers 

Wi  :=  w°n  G  G  [r],  are  also  linearly  independent. 
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Proof:  For  the  sake  of  contradiction,  let  the  tensors  Wt ,  i  G  [r],  be  linearly  dependent.  Therefore, 

there  exist  coefficients  a;  £  1,  i  6  [r],  not  all  zero,  such  that 

r 

J2aiWi  =  0.  (29) 

Without  loss  of  generality,  assume  that  a,\  0.  Since  wf  s  are  linearly  independent,  we  have  wt  0 

for  all  i  G  [r]  and  therefore,  there  exists  some  k  €  \p\  such  that  wi(k)  /  0. 

Consider  the  fibers  /*  :=  Wi{  1  :  p,k,k,...  ,k),i  €  [r],  corresponding  to  the  vectors  obtained  by 
fixing  all  but  first  indices  of  Wf  s.  From  tensor  outer  product  definition  in  (8),  we  have  /*  =  fiiWi 
where  /3j  :=  Wi(k)n _1  €  M  for  all  i  G  [r].  Furthermore,  according  to  the  special  selection  of  k 
mentioned  earlier  we  have  f>\  7^  0. 

Restricting  equality  (29)  to  the  subset  indexed  by  (1  :  p,  k,  k, , . . . ,  k),  results 

r  r 

^2  oii^i  =  iiWi  =  °> 

2=1  2=1 

where  7 j  :=  aify.  Since  at  least  one  of  the  scalar  coefficients  7 j,  i  G  [r],  is  non-zero  (71  :=  a\Pi  /  0), 
it  is  concluded  from  above  equality  that  vectors  Wi,i  G  [r],  are  linearly  dependent.  This  contradicts 
the  assumption  of  lemma  and  completes  the  proof.  □ 

Finally,  Theorem  1  is  proved  by  combining  the  results  of  Theorem  6  and  Lemma  5. 

Proof  of  Theorem  1:  Since  conditions  2  and  3  hold  and  A  is  generic,  Lemma  5  can  be  applied 
which  results  that  rank  condition  6  is  satisfied  almost  surely  and  expansion  condition  7  also  holds. 
Therefore,  all  the  required  conditions  for  Theorem  6  are  satisfied  almost  surely  and  this  completes 
the  proof.  □ 


B  Proof  of  Random  Identifiability  Result  (Theorem  2) 

According  to  the  proof  sketch  provided  in  Section  5.1,  the  steps  for  the  proof  of  Theorem  2  are 
provided  in  the  following  subsections. 

B.l  Proof  of  existence  of  perfect  n-gram  matching  and  Kruskal  results 

Proof  of  Theorem  f:  Define  J  :=  c^.  Divide  set  X  randomly  (uniform)  into  n  different  partitions 

with  (almost)  equal  size9  denoted  by  X^2\l  G  [n].  Define  sets  xjl>  :=  U li=1X^2\l  G  [n].  Further¬ 
more,  divide  set  Y  randomly  (uniform)  into  0(pn~1)  partitions  with  size  at  most  J  =  c^.  Applying 
Lemma  8  and  Theorem  3,  whp,  there  exists  a  perfect  matching  from  each  of  these  partitions  of  Y 
to  set  X^ ■  Then,  we  combine  every  J  number  of  these  partitions  (on  Y  side)  creating  0(pn~ 2) 
new  bipartite  graphs.  Therefore,  Lemma  7  can  be  applied  which  results  that  whp,  there  exists  a 
perfect  2-gram  matching  from  each  of  these  combined  partitions  of  Y  (with  size  less  than  or  equal 
to  J2  =  (c^)2  =  0(p2))  to  set  x!2l> .  This  combining  procedure  is  performed  iteratively;  in  step  l, 

9By  almost,  we  mean  the  maximum  difference  in  the  size  of  partitions  is  1  which  is  always  possible. 
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every  J  number  of  partitions  on  Y  are  combined  to  create  0(pn  1 )  new  bipartite  graphs.  Applying 
Lemma  7,  whp,  there  exists  a  perfect  /- gram  matching  from  corresponding  partitions  of  Y  (with 
size  less  than  or  equal  to  Jl  =  (c^)/  =  0{p1))  to  set  x\l\  Finally  at  iteration  n,  whp,  we  have 
a  perfect  ro-gram  matching  from  Y  (with  size  less  than  or  equal  to  Jn  =  (c-)n  =  0(pn ))  to  set 


(i) 


=  X. 


Above  discussion  is  the  main  part  of  the  proof,  but  in  order  to  complete  the  proof,  it  is  required  to 
argue  on  the  total  number  of  times  that  perfect  matching  result  proposed  in  Theorem  3  is  used  in 
the  above  random  discussion.  In  this  way,  we  want  to  ensure  that  high  probability  rate  proposed  in 
Theorem  3  still  holds  after  several  times  of  its  exploitation.  Let  X^ip^  denote  the  total  number  of 
times  that  perfect  matching  result  proposed  in  Theorem  3  is  used  in  the  above  random  discussion. 
Similarly,  let  Njhp\l  E  {2, . . . ,  n},  denote  the  total  number  of  times  that  perfect  matching  result 
proposed  in  Theorem  3  is  used  in  step  /  to  ensure  that  there  exists  a  perfect  /-gram  matching  from 
corresponding  partitions  of  Y  to  set  X^1*,  whp  (Note  that  is  is  done  through  Lemma  7  as  explained 
in  the  above  discussion). 

As  mentioned  in  Lemma  7,  let  Pi—\  ^X^  j  denote  the  set  of  all  subsets  of  X^  with  cardinality 
l  —  1  with  has  the  size 


Pl-ilX, 


XI) 
L- 1 


X (!) 

Xl- 1 

l-l 


—p 
n  1 

i  -  i 


Pi- ifx, 


Xl) 


L-l 


According  to  the  construction  method  of  /- gram  matching  proposed  in  Lemma  7. 
is  the  number  of  times  Theorem  3  is  used  in  order  to  ensure  that  there  exists  a  perfect  /- gram 
matching  for  each  partition  on  Y  side.  Since  at  most  Jn~l  number  of  such  /-gram  matchings  are 
proposed  in  step  /,  the  number  N^'1  can  be  bounded  as 


x/hp)  <  Jn~l 


Pi-1  (X, 


XI) 

L-l 


=  J 


n—l 


—p 
n  1 

i  -  i 


(30) 


Since  in  the  first  step,  Jn  1  number  of  perfect  matchings  needs  to  exist  in  the  above  discussion,  we 
have 


X( hp)  =  J”'1  +  ^2  Nt 


(hp) 


1=2 


<  j"-1  +  j 


n—l  I  n  P 


1=2 


l  -  1 


P 


<  (  C-  )  + 

n 


n—l 


1=2 


n—l  /  p  \  l-l 

e— 

n 


<  n  e 


P 


n 


n—l 


where  inequality  (30)  is  used  in  the  first  inequality  and  J  :=  cX  and  inequality  (?)  <  (e^)k  are 
exploited  in  the  second  inequality.  Therefore,  X(hp)  =  0(pn~ 1)  and  Theorem  3  is  used  polynomial 
number  of  times.  □ 
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Proof  of  Theorem  5:  Let  G(Y,X ;  A)  denote  the  corresponding  bipartite  graph  to  matrix  A  where 
node  sets  Y  =  [q]  and  X  =  \p]  index  the  columns  and  rows  of  A  respectively.  Therefore,  |Y|  =  q 
and  |X|  =  p. 

Fix  some  STY.  Then 


Pr(|JV(5)|  <  \S\)  <  Pr(iV(S)  C  T)  < 

TCI: 

\T\=\S\ 


(v  vmv|s| 

vi'S’iy  v  p ) 


where  the  bound  (^)/(d)  <  (^)  is  used  in  the  last  inequality. 

Let  £  denote  the  event  that  for  any  subset  STY  with  \S\  <  r,  we  have  |iV(£)|  >  |<S|,  i.e., 

£  :=  “ VS  C  Y  A  1  <  \S\  <  r  :  \N(S)\  >  |5|”. 

Then 

Pr(^c)  =Pr(35C  Ys.t.  1  <  \S\  <rA|iV(5)|  <  \S\)  < 

< 

< 


where  the  bound  (^)  <  (ef)fc  is  used  in  the  second  inequality. 
For  r  :=  ,  the  above  inequality  reduces  to 


where  the  degree  condition  assumed  in  the  theorem  is  used  in  the  second  inequality  and  the  size 
condition  is  exploited  in  the  last  inequality  by  defining  d  :=  (^)n.  Since  c',/3  and  n  are  constants 
and  (3  —  n  +  l>0by  the  theorem  assumption,  it  is  concluded  that 


lim  Pr(£c)  =  0, 

p— too 

which  results  that  event  £  happens  whp.  Therefore,  Lemma  9  can  be  applied  concluding  that 
krank(yl)  >  r  =  ^p,  whp.  □ 

Proof  of  Remark  1:  Consider  a  random  bipartite  graph  G(Y,  X;  E)  where  for  each  node  i  G 
X: 
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1.  Neighbors  N(i)  C  A  is  picked  uniformly  at  random  among  all  size  d  subsets  of  A. 

2.  Matching  M{i)  C  N(i)  is  picked  uniformly  at  random  among  all  size  n  subsets  of  N(i). 

Note  that  as  long  as  n  <  d,  the  distribution  of  M(i)  is  uniform  over  all  size  n  subsets  of  X. 
Fix  some  pair  i,  %'  G  Y.  Then 

Pr(M(i)  =  Mii'))  =  (\X\ 

\  n 

By  the  union  bound, 


which  is  0(|Y|2/|A|n)  when  n  is  constant.  Therefore,  if  d  >  n  and  the  size  constraint  |Y|  =  0(|A|S) 
for  some  s  <  ^  is  satisfied,  then  whp,  there  is  no  pair  of  nodes  in  set  Y  with  the  same  random 
n-gram  matching.  This  concludes  that  the  random  bipartite  graph  has  a  perfect  n-grarn  matching 
whp,  under  these  size  and  degree  conditions. 

□ 


Pr 


i!  €  Y,  i  /  %'  s.  t.  M (i)  =  M (i')^j  <  ^ 


B.2  (Auxiliary)  lemmata 

Lemma  7.  Consider  a  bipartite  graph  G(Y,  X;  E)  with  |Y|  =  r  and  |A|  =  s  where  r  =  0(sl )  and 
each  node  i  G  Y  is  randomly  connected  to  di  different  nodes  in  set  X.  Divide  the  nodes  in  set  Y 
randomly  (uniform)  to  Ji  :=  cf  partitions  Y\,...,Yjl  with  (almost)  equal  size  for  some  constant 
c  <  1.  In  addition,  divide  the  nodes  in  set  X  randomly  (uniform)  to  two  partitions  X\  and  X 2  with 
sizes  |  Ai|  =  1~y-s  and  \X2\  =  f.  Next,  create  Ji  different  bipartite  graphs  Gi(Yi,X\;  Ef),i  G  [Ji],  by 
considering  partitions  Y \  and  X\  and  the  corresponding  subset  of  edges  Ei  C  E  incident  to  them. 
Refer  to  Figure  fa.  Furthermore,  assume  that  di  >  /31ogs  for  some  (5  >  l2/2.  Then,  if  each  of  the 
corresponding  Ji  bipartite  graphs  Gi(Yi,  X\]  Ef),i  G  [Ji],  has  a  perfect  ( l  —  l)-gram  matching,  then 
whp,  the  original  bipartite  graph  G(Y,  A;  E)  has  a  perfect  l- gram  matching. 

Proof:  Let  us  denote  the  corresponding  perfect  (l  —  l)-gram  matching  of  Gi(Yi,  X±;  Ef)  by  Ml . 

Furthermore,  the  set  of  all  subsets  of  Ai  with  cardinality  l—l  are  denoted  by  Pi- i(Ai),  i.e.,  P;_i(Ai) 
includes  the  sets  with  (l  —  1)  elements  in  the  power  set 10  of  Ai.  For  each  set  S  G  Pi-\  [X\ ),  take  the 
set  of  all  nodes  in  Y  which  are  connected  to  all  members  of  S  according  to  the  union  of  matchings 
U^_i Mj.  Call  this  set  as  the  parents  of  S  denoted  by  Pa(5).  According  to  the  definition  of  perfect 
(l  —  l)-gram  matching,  there  is  at  most  one  node  in  each  set  Y,  which  is  connected  to  all  members 
of  S  through  the  matching  M,;  and  therefore  |  Pa(5)|  <  J;  =  c|.  In  addition,  note  that  sets  Pa(5) 
impose  a  partitioning  on  set  Y,  i.e.,  each  node  j  G  Y  is  exactly  included  in  one  set  Pa(S')  for  some 
S  G  P/—\  (A) ).  This  is  because  of  the  perfect  ( l  —  l)-gram  matchings  considered  for  sets  Yt,  i  G  [J{\. 
Now,  a  perfect  /-gram  matching  for  the  original  bipartite  graph  is  constructed  as  follows.  For  any 
S  G  Pi- i(Ai),  consider  the  set  of  parents  Pa(S’).  Create  the  bipartite  graph  G(Pa(5),  A2;  E$) 
where  Eg  C  E  is  the  subset  of  edges  incident  to  partitions  Pa(S’)  C  Y  and  A2  C  A.  Denote  by 

10The  power  set  of  any  set  S  is  the  set  of  all  subsets  of  S. 
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(a)  Original  partitioning  of  sets  Y  and  (b)  Partitioning  of  set  Y  through  perfect 

X  proposed  in  Lemma  7.  ( l  —  l)-gram  matchings  M.^i  G  [J/]. 


Figure  4:  Auxiliary  figures  for  proof  of  Lemma  7.  (a)  Original  partitioning  of  sets  Y  and  X  proposed  in 
the  lemma  where  set  Y  is  partitioned  to  J;  :=  cf  partitions  Yi, . . .  1Yjl  with  (almost)  equal  size  for  some 
constant  c  <  1.  In  addition,  set  X  is  partitioned  to  two  partitions  Xi  and  X2  with  sizes  |Xi|  =  M-s  and 
IX2 1  =  f.  The  bipartite  graphs  Gi(Yi,  X\- Ef),  i  £  [J/],  are  also  shown  in  the  figure,  (b)  Set  Y  is  partitioned 
to  subsets  Pa(S'),  S  £  Pi-i(Xi),  which  is  generated  through  perfect  (l  —  l)-gram  matchings  M,,i  £  [J;].  Si, 
S2  and  S3  are  three  different  sets  in  Pi-i  [X\ )  shown  as  samples.  In  addition,  the  perfect  matchings  from 
Pa(S),  S  £  Pi-i(Xi),  to  X2  proposed  in  the  proof  are  also  pointed  in  the  figure. 

ds  the  minimum  degree  of  nodes  in  set  Pa(S)  in  the  bipartite  graph  G(Pa(S),  X2]  E$).  Applying 
Lemma  8,  it  is  concluded  that 11 

Pr [ds  >  3]  >  1  —  Ji  exp(-|(d*  ~2°2)  . 

Furthermore,  we  have  |  Pa(S)|  <  cf  =  c|X2|-  Now,  we  can  apply  Theorem  3  concluding  that  whp 
there  exists  a  perfect  matching  from  Pa(S')  to  X2  within  the  bipartite  graph  G(Pa(S),  Es). 
Refer  to  Figure  4b  for  a  schematic  picture.  The  edges  of  this  perfect  matching  are  combined  with 
the  corresponding  edges  of  existing  perfect  (l  —  l)-granr  matchings  Mj,  i  G  [J/],  to  provide  l  incident 
edges  to  each  node  i  G  Pa(S).  It  is  easy  to  see  that  this  proposes  a  perfect  /-gram  matching  from 
Pa(S)  to  X. 

We  perform  the  same  steps  for  all  sets  S  G  P;_i  (X\ )  to  propose  a  perfect  l- gram  matching  from 
any  Pa(S')  to  X.  Finally,  according  to  the  construction,  the  union  of  all  of  these  matchings  is  a 
perfect  l- gram  matching  from  Pa(-S')  =  Y  to  X  and  the  result  is  proved.  □ 

Lemma  8  (Degree  concentration  bound).  Consider  a  random  bipartite  graph  G(Y,X;E)  with 
\Y\  =  q  and  |X|  =  p  where  each  node  i  G  Y  is  randomly  connected  to  d  different  nodes  in  set 
X.  Let  Y'  C  Y  be  any  subset 12  of  nodes  in  Y  with  size  |y'|  =  q'  and  X'  C  X  be  a  random 
(uniformly  chosen)  subset  of  nodes  in  X  with  size  p'  :=  \X'\  =  p/n.  Create  the  new  bipartite  graph 
G(Y' ,  X']  E')  where  edge  set  E'  C  E  is  the  subset  of  edges  in  E  incident  to  Y'  and  X' .  Denote  the 
degree  of  each  node  i  G  Y'  within  this  new  bipartite  graph  by  d!i.  Define  d!  :=  miiijey/  d[.  Then,  if 

11Note  that  in  the  context  of  Theorem  4,  this  bound  can  be  written  as  Pr[ds  >  3]  >  1  —  J;  exp^—  -fz  ^ .  It 

is  concluded  from  the  fact  that  when  the  application  of  this  lemma  in  Theorem  4  is  considered,  we  have  s  =  —p  and 
therefore  \X2\  =  ^  . 

12Note  that  Y'  need  not  to  be  uniformly  chosen  and  the  result  is  valid  for  any  subset  of  nodes  Y'  C  Y. 
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d  >  rn  for  a  non-negative  integer  r,  we  have 


Pr[(/  >  r  +  1]  >  1  —  q'  exp 


2  (d  —  rn)2\ 

n 2  d  J 


Proof:  For  any  i  €  Y' ,  we  have 


Pi-K  <  r]  = 

3=0 


p  —  p 

d~j 


where  the  inner  term  of  summation  is  a  hypergeometric  distribution  with  parameters  p  (population 
size),  p'  (number  of  success  states  in  the  population),  ^(number  of  draws)  and  j  is  the  hypergeomet¬ 
ric  random  variable  denoting  number  of  successes.  The  following  tail  bound  for  the  hyper  geometric 
distribution  is  provided  [41,42] 

Pr[d'  <  r]  <  exp(— 2 t2d), 

for  t  >  0  given  by  r  =  (^  —t)d.  Note  that  assumption  d  >  rn  in  the  lemma  is  equivalent  to  having 
t  >  0.  Substituting  t  from  this  equation  gives  the  following  bound 


Pr[c?(  <  r]  <  exp^ 


2  (d  —  rn)2\ 
n2  d  J 


Finally,  applying  the  union  bound,  we  can  prove  the  result  as  follows 


(31) 


Pr [d'  >  r  +  1]  =  Pr[n^=1{d'  >  r  +  1}] 
=1  —  Pr 


n?=i-K  >r  +  i} 

=  1  -  Pr[U f=1{d?i  <  r}] 
g' 

>1  -^Pr[(i'  <  r] 

1=1 


>1 

=1 


J^exp 


1=1 


q  exp 


2  (d  —  rn)2\ 
n2  d  ) 

2  {d  —  rn)2\ 

n 2  d  J’ 


(32) 


where  the  union  bound  is  applied  in  the  first  inequality  and  the  second  inequality  is  concluded  from 
(31).  □ 


Note  that  more  strict  degree  condition  5  proposed  in  Section  3.2,  implies  that  the  probability  bound 
proposed  in  the  above  lemma  goes  to  one  with  the  rate  proportional  to  inverse  polynomial  function 
of  q.  Therefore,  the  lower  bound  on  degree  in  the  above  lemma  holds  with  whp. 


A  lower  bound  on  the  Kruskal  rank  of  matrix  A  based  on  a  sufficient  relaxed  expansion  property 
on  A  is  provided  in  the  following  lemma. 

Lemma  9.  If  A  is  generic  and  the  bipartite  graph  G(Y,X\A)  satisfies  the  relaxed 13  expansion 
property  |1V(£)|  >  |5|  for  any  subset  S  C.Y  with  |5|  <  r,  then  krank(A)  >  r,  almost  surely. 

13There  is  no  dmax  term  in  contrast  to  the  expansion  property  proposed  in  condition  7. 
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Before  proposing  the  proof,  we  state  the  marriage  or  Hall’s  theorem  which  gives  an  equivalent 
condition  for  having  a  perfect  matching  in  a  bipartite  graph. 

Theorem  7  (Hall’s  theorem,  [43]).  A  bipartite  graph  G(Y,  X;  E)  has  Y -saturating  matching  if 
and  only  if  for  every  subset  S  C  Y,  the  size  of  the  neighbors  of  S  is  at  least  as  large  as  S,  i.e., 

WS)\>\S\. 

Proof  of  Lemma  9:  Denote  the  submatrix  ANts),S  by  As,  i.e.,  As  :=  Azv(S),s-  Exploiting  marriage 
or  Hall’s  theorem,  it  is  concluded  that  the  bipartite  graph  G(S,N(S);As)  has  a  perfect  matching 
Ms  for  any  subset  S  C  Y  such  that  |5|  <  r.  Denote  by  Ams  the  corresponding  matrix  to  this 
perfect  matching  edge  set  Ms,  i.e.,  Ams  keeps  the  non-zero  entries  of  As  on  edge  set  Ms  and 
everywhere  else,  it  is  zero.  Note  that  the  support  of  Ams  is  within  the  support  of  As-  According 
to  the  definition  of  perfect  matching,  the  matrix  Ams  is  full  column  rank.  From  Lemma  6,  it  is 
concluded  that  As  is  also  full  column  rank  almost  surely.  This  is  true  for  any  As  with  S  C  Y  and 
|S|  <  r,  which  directly  results  that  krank(A)  >  r,  almost  surely.  □ 

Finally,  Theorem  2  is  proved  by  exploiting  the  random  results  on  the  existence  of  perfect  n-grarn 
matching  and  Kruskal  rank,  provided  in  Theorems  4  and  5. 

Proof  of  Theorem  2:  It  is  shown  that  if  random  conditions  4  and  5  are  satisfied  then  deterministic 
conditions  2  and  3  also  hold.  Then  Theorem  1  can  be  applied  and  the  proof  is  done. 

According  to  the  theorem  assumptions,  size  and  degree  conditions  required  for  Theorem  4  hold 
and  therefore  by  applying  this  theorem,  the  perfect  n-gram  matching  condition  2  is  satisfied  whp. 
The  conditions  required  for  Theorem  5  also  hold  and  by  applying  this  theorem  we  have  the  bound 
krank(A)  >  -p,  whp.  Combining  this  inequality  with  the  upper  bound  on  degree  d  in  condition 
5,  concludes  that  krank  condition  3  is  also  satisfied  whp.  Hence,  all  the  conditions  required  for 
Theorem  1  are  satisfied  whp,  and  this  completes  the  proof.  □ 
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